Importing very large dataset (5,000,000 entries) into R.

gasp97 · December 18, 2020, 3:14am

Hello everyone,

So I'm really at my wit's end here, but I have a large dataset that I'm trying to import into R, but my computer takes hours to try and out it in before running out of memory and failing to process it. The file is a ndjson file and is a set of yelp reviews from their dataset challenge. As of now I have tried the following:

Using jsonlite's stream_in function to process it. This works until like 3 million, when it slows down drastically, and gives up at 4 million
The same option but with pagesize = 100k or a million - pretty much the same thing since the problem is the overall memory
Using the ndjson package, which is supposed to be faster and more efficient, but basically the same. It sat there for 2 hours before saying it ran out of memory as well
Using the readLines command to break the dataset down into a smaller section - I tried subsetting a million entries but it's still processing (about an hour now). I can try with a smaller number but I feel like it won't be viable.

FYI, my computer is running R on 64bit and has 8gb of RAM. Thanks in advance!

AlexisW · December 19, 2020, 3:41am

You have two separate problems: memory, and speed. The memory (RAM) of your computer is fixed, apparently you don't have enough to store all records at once, there is nothing you can do about that, so methods 1, 2, 3 won't work no matter what package you use.

(small technical note: there could be a way, let's say you can use 6GB for R and you have 5,000,000 records, that still gives 1 MB per record, so perhaps there would be a way to reduce the amount of data in one record when loading it to fit each one in less than 1 MB, which is roughly the size of 1,000 characters of text)

So, I see three possibilities. One is to use a database for on-disk storage, and use dbplyr or SQL to perform operations on the database. That could work well with numeric columns and standard dplyr operations, I'm not sure it would go great with text mining.

Another possibility is your approach 4: to process the input in chunks. The question is about the processing speed. I would recommend that you run your readLines()and processing on sections with 10, 50, 100, 500, 1000, 5000 and 10,000 records (or until it becomes too long), and plot how the processing speed depends on the number of records. That gives you 3 things.

First, that gives you an estimate of how long it takes for a given number of records.
Second, it gives you an idea of how efficient your code is. If time and number of records are proportional (doubling the number of records doubles the processing time), it means your code is basically performing one operation on each record. But if the time increases faster than the number of records (for example doubling the number of records multiplies the processing time by 4), then there could be something to improve (that suggests you're using nested loops, and for each record you read, you re-read all the other records). Of course that could also be necessary for your analysis.
You can identify bottlenecks. For a small number of records (a number that takes a few seconds to run), you can run code profiling and see what is the longest step. There is a good chance that a single step is particularly slow, and that you can do something about it and make the whole process a lot faster.

And finally, another possibility: for many analyses it makes sense to just take a subsample of the original data and run code on these. Perhaps you don't need millions of records for your question anyway.

Of course, what is the best solution and what you can improve depends a lot on your data and approach.

gasp97 · December 19, 2020, 4:27am

Hi @AlexisW, Thank you for your reply!

I've tried doing a readLines and it does work for the first 500k entries without a problem - the question now is, how do I read the subsequent batches of lines, from 500k to 1 million and etc? looking at the readLines documentation, you can only read the first x lines, or the the first total numbers - x lines. Once I have the data broken into dataframes of 500k its not a problem, but how do I make the next few? thanks!

AlexisW · December 19, 2020, 4:30am

Does readr::read_lines_chunked() do what you need?

AlexisW · December 19, 2020, 4:36am

The other possibility I could think of is to use the command line, and run sed to extract blocks that you can pipe into an R script. There are probably variations that you could come up with if needed.

gasp97 · December 19, 2020, 4:55am

Hi Alexis, this seems to be very promising, however I seem to be having problems with the callback function (sorry, rather new to R). This is what I have so far:

reviews <-read_lines_chunked("D:\R files\yelp_academic_dataset_review.json",stream_in(textConnection(), chunk_size = 100000)

However I get an error that Error: Argument 'con' must be a connection.

how should I be formatting this so that each chunk gets turned into an individual data frame?

Thanks!

AlexisW · December 19, 2020, 5:07am

The stream_in() function is from which package? ndjson()? Could you share an example ndjson file to test?

If it's jsonlite you can also try directly fromJSON(), that takes text rather than a connection (a connection is just a fancy name for a file). In the package ndjson, I think flatten() can handle text.

Also, in that command you have a problem with the parenthesis. Make sure chunk_size is an option for read_lines_chunked, not for the json reader (the json reader should read everything that arrives).

It might be easier to separate the callback as a full function:

my_callback_function <- function(single_line){
  one_record <- jsonlite::fromJSON(single_line, simplifyDataFrame = TRUE, flatten = TRUE)
  process(one_record)
}

(probably needs a bit of trial and error)

gasp97 · December 19, 2020, 5:33am

The stream_in function is the one in fromJSON itself. the file is an ndjson so in order to process it with jsonlite you need to use stream_in(file("example.json")) . An example of this would be the yelp dataset here , I'm using the review dataset but would say any of the smaller jsons would work since they are much smaller just for testing.

I've used the stream_in(textConnection(readLines("example.json", n = 5000000))) and it worked for that, but the chunks arent a connection right? i tried fromJSON but it doesnt work cause the text is still unfortunately in the ndjson format.

AlexisW · December 19, 2020, 5:55am

Playing a little bit with it on Kaggle (I somehow failed to download it), these two approaches seem to work:

raw_lines <- readr::read_lines("/kaggle/input/yelp-dataset/yelp_academic_dataset_checkin.json", n_max = 10)
length(raw_lines) #character vector with one element per json record
#> 10

# With ndjson
 # by default a data.table, can be loaded as a tibble if preferred
loaded_by_ndjson <- ndjson::flatten(raw_lines)
single_by_jsonlite <- jsonlite::fromJSON(raw_lines[[1]])
single_by_jsonlite$business_id == loaded_by_ndjson [1,1]
#> TRUE
single_by_jsonlite$date == loaded_by_ndjson [1,2]
#> TRUE

So you can use either ndjson::flatten() to read all the entries or jsonlite::fromJSON() to read the entries one by one, so that means you will need to loop on the entries one way or another. For example:

mapped_jsonlite <- purrr::map_df(raw_lines, jsonlite::fromJSON)

In this case you don't want to use stream_in(), since you don't have a connection.

gasp97 · December 19, 2020, 3:39pm

Hi,

so I've been tinkering a bit and I feel like I'm really close, but something isnt really right:

g <- function(x, pos) {
df <- purrr::map_df(x, jsonlite::fromJSON)
assign(paste('r',pos/50000, sep = " "), df)
}
read_lines_chunked("D:\R files\yelp_academic_dataset_review.json",
g,
chunk_size = 500000)

This set of code actually reads through the whole dataset (finally), but after that it returns an output of NULL after it is done. As you can see in the function i tried to use it to make a data frame from the chunk it was working on (so that I can merge them together later), but the final result only returns null and there is no actual data frame being created.

I feel like this is a problem with the chunk callback function that I'm not getting. When i tried getting it to do DataFrameCallback$new, the console again overloaded since it was trying to store the read tables together before finishing. What should I be putting in to fix this? and given that this might be a slightly different topic from the original one should I start a new thread?

Thanks and sorry for the many questions!

nirgrahamuk · December 19, 2020, 5:15pm

a function has its own environment (scope) seperate from global.
your assignment assigns to that, by default, as you don't specify the environment to assign to.
use

assign(x=paste('r',pos/50000, sep = " "), 
      value = df,
      envir=.GlobalEnv)

AlexisW · December 19, 2020, 7:12pm

Now you're running back into the previous problem: you don't have enough memory to store all the records simultaneously. The idea of the callback function is that it does processing and then only returns a result and discards all the data, freeing up memory for the next chunk. The callback function is a bit hard to define however, because it requires a specific formulation.

Example callbacks

For example, using a ListCallback you can get a list where each element is the result of processing of one chunk.

Here is an example where we want to sum up numeric values from a file:

# Let's create an example file with random numbers
write_lines(rnorm(20,20,1), "some_numbers.txt")

# Whole processing by chunks in the classic way
all_values <- readLines("some_numbers.txt")

chunks <- list(1:5,6:10,11:15,16:20)
whole_res <- map(chunks, ~ sum(as.double(all_values[.x])))
whole_res
# [[1]]
# [1] 97.39217
# 
# [[2]]
# [1] 104.7825
# 
# [[3]]
# [1] 99.58542
# 
# [[4]]
# [1] 99.14727

# => each element is the sum of the values in the chunk

# Now let's use automated chunking
processing <- function(values,index){
  sum(as.double(values))
}

# Chunked processing
chunked_res <- read_lines_chunked("some_numbers.txt",
                   ListCallback$new(processing),
                   chunk_size = 5)

# the two approaches are equivalent
all.equal(whole_res, chunked_res)
#> [1] TRUE

If we want the results to be in a data frame instead, we can use DataFrameCallback:

processing <- function(values,index){
  c(sum = sum(as.double(values)))
}

# Chunked processing
chunked_res <- read_lines_chunked("some_numbers.txt",
                                  DataFrameCallback$new(processing),
                                  chunk_size = 5)
chunked_res
#            sum
# [1,]  97.39217
# [2,] 104.78252
# [3,]  99.58542
# [4,]  99.14727

And if we want the total sum we can use an AccumulateCallBack (that uses the variable acc from the previous chunk):

processing <- function(values,index, acc){
  acc + sum(as.double(values))
}

# Chunked processing
chunked_res <- read_lines_chunked("some_numbers.txt",
                                  AccumulateCallback$new(processing, acc=0),
                                  chunk_size = 5)
chunked_res
#> 400.9074

# standard sum on the whole file
sum(as.double(all_values))
#> 400.9074

For your specific dataset

You obviously can't load all the json files in memory all at once, since all taken together they make up for 10 GB, which is more than your computer has. But if you only wanted to load "yelp_academic_dataset_checkin.json (428.83 MB)", you should be able to do it (provided you have no other big object in your R session, and you don't have some other software like a web browser that is using all your RAM). In that case you could try something like:

read_json <- function(values, pos){
  map_dfr(values, ~ jsonlite::fromJSON(.x))
}
read_lines_chunked("exple2.ndjson.txt",
                   DataFrameCallback$new(read_json),
                   chunk_size = 5)
# A tibble: 10 x 2
#    business_id          date                                                      
#    <chr>                <chr>                                                     
#  1 --1UhMGODdWsrMastO9~ 2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:4~
#  2 --6MefnULPED_I942Vc~ 2011-06-04 18:22:23, 2011-07-23 23:51:33, 2012-04-15 01:0~
#  3 --7zmmkVg-IMGaXbuVd~ 2014-12-29 19:25:50, 2015-01-17 01:49:14, 2015-01-24 20:3~
#  4 --8LPVSo5i0Oo61X01s~ 2016-07-08 16:43:30                                       
#  5 --9QQLMTbFzLJ_oT-ON~ 2010-06-26 17:39:07, 2010-08-01 20:06:21, 2010-12-09 21:2~
#  6 --9e1ONYQuAa-CB_Rrw~ 2010-02-08 05:56:47, 2010-02-15 04:47:42, 2010-02-22 03:2~
#  7 --DaPTJW3-tB1vP-Pfd~ 2012-06-03 17:46:09, 2012-08-04 16:19:52, 2012-08-04 16:2~
#  8 --DdmeR16TRb3LsjG0e~ 2012-11-02 21:26:42, 2012-11-02 22:30:43, 2012-11-02 22:4~
#  9 --EF5N7P70J_UYBTPyp~ 2018-05-25 19:52:07, 2018-09-18 16:09:44, 2019-10-18 21:2~
# 10 --EX4rRznJrltyn-34J~ 2010-02-26 17:05:40, 2012-12-29 20:05:04, 2012-12-30 22:0~

(this is for a manually copy-pasted ndjson of the 10 first lines in "yelp_academic_dataset_checkin")

But still note that this would work well for a small-ish file, but you can't do that for too big files, and it won't work for the whole dataset at once. What you need to do is use the callback function to compute some summary statistics, and only combine these while discarding the raw data. What summary statistics make sense depend on your question.

system · January 9, 2021, 7:12pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.