Quickly read very large tables as data frames

Keywords: Hadoop SQLite SQL Database

I have a very large table (30 million rows), and I want to load it as a data frame in R. read.table() has many convenient functions, but it seems that there are many logic in the implementation that will slow things down. As far as I'm concerned, I'm assuming that I know the type of column in advance, that the table doesn't contain any column headers or row names, and that there are no pathological characters I'm worried about.

I know that using scan() to read a table in the form of a list is very fast, for example:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert it into data frames seem to reduce that performance six times:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way? Or a completely different solution?

#1 building

Curiously, even if this is a very important question, no one has answered the bottom part of the question for many years, data.frame is just a list with correct attributes, so if you have a lot of data, you don't want to use it as an as.data.frame or similar list. Simply convert the list "in place" to a data box faster:

attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"

This does not replicate data, so it is immediate (unlike all other methods). Suppose you have set the corresponding names() on the list.

[about loading big data into R - for individuals, I dump them in columns into binary files and use readBin() - the fastest method so far (except mmapping) and limited only by disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data. ]

#2 building

Here's an example of how to use free from data.table 1.8.7

These examples come from the help page fread, the timing for my Windows XP Core 2 Duo E8400.

library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

Standard reading table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

Optimized read table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

Fear

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

ff / ffdf

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

in summary:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

#3 building

Update in a few years

The answer is very old. R goes on. adjustment read.table There is little benefit in making it run faster. Your choice is:

Use data.table fread Import data from csv / tab separated files directly into R. Please see mnel's answer .
stay readr Use read_table (used on CRAN since April 2015). It's very much like the Fred above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2 times slower" than data. Table:: freed).
read.csv.raw Of iotools Provides a third option for quickly reading CSV files.
Try to store as much data as possible in a database instead of a flat file. (and better permanent storage media, where data is passed to and from R in binary format read.csv.sql , which is faster.) sqldf In bag sqldf Such as JD Long's answer , import the data into a temporary file. SQLite database, and then read it into R. See also: RODBC Packages, and DBI package The reverse dependency part of the page. MonetDB.R It provides you with a data type disguised as a data frame, but it's actually the MonetDB below it, which improves performance. Use it monetdb.read.csv Function to import data. dplyr Allows you to work directly with data stored in several types of databases.
Storing data in binary format is also useful for improving performance. Using saveRDS / readRDS (see below), the h5 or rhdf5 In HDF5 format, or package write ﹣ FST / read ﹣ FST from fst Bag.

Original answer

Whether you use read.table or scan, there are some simple ways to try.

Set nrows = the number of records in the data (scan nmax).
Make sure comment.char = "" turns off comment interpretation.
Use read.table colClasses to explicitly define the classes for each column.
Setting multi.line=FALSE can also improve scan performance.

If none of these works, use the One of the performance analysis software packages To determine which ones are slower. Perhaps you can write a simplified version of read.table based on the results.

Another option is to filter the data before reading it into R.

Or, if the problem is that you have to read the data on a regular basis, you can use these methods to read the data once, and then save the data frame as a binary blob using the following methods: save saveRDS , you can use to retrieve it faster next time load readRDS .

#4 building

before Once in R-Help upper Inquired This content is worth reviewing.

One suggestion is to use readChar() and then strsplit() and substr() to string the results. You can see that readChar involves much less logic than read.table.

I don't know if there is a memory problem here, but you may Also want to see Hadoop streaming package . this Hadoop used ， Hadoop Is a MapReduce framework designed to handle large datasets. To do this, you will use the hsTableReader function. This is an example (but it has a learning curve for learning Hadoop):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

The basic idea here is to divide the data import into multiple blocks. You can even use a parallel framework (such as snow) and run data import in parallel by segmenting files, but it's useless for large data sets, because you will encounter memory limitations, which is why map reduce is a better way.

#5 building

I didn't see the question at first and asked a similar question a few days later. I'll take note of the last question, but I'd like to add an answer here to explain how I did this using sqldf().

About the best way to import 2GB or more text data into R data frames, Rarely discussed . Yesterday, I wrote A blog post, It involves using sqldf() to import data into SQLite as a staging area, and then suck it into R from SQLite. It's really good for me. I was able to extract 2GB (3 columns, 40mm rows) of data in less than 5 minutes. By contrast, the read.csv command runs overnight and never completes.

This is my test code:

Set test data:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)

I restarted R before running the following import routine:

library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

I kept the following lines running all night, but never finished:

system.time(big.df <- read.csv('bigdf.csv'))

Posted by Johnlbuk on Fri, 17 Jan 2020 07:09:11 -0800

Programmer Group