Now you're running back into the previous problem: you don't have enough memory to store all the records simultaneously. The idea of the callback function is that it does processing and then only returns a result and discards all the data, freeing up memory for the next chunk. The callback function is a bit hard to define however, because it requires a specific formulation.
Example callbacks
For example, using a ListCallback
you can get a list where each element is the result of processing of one chunk.
Here is an example where we want to sum up numeric values from a file:
# Let's create an example file with random numbers
write_lines(rnorm(20,20,1), "some_numbers.txt")
# Whole processing by chunks in the classic way
all_values <- readLines("some_numbers.txt")
chunks <- list(1:5,6:10,11:15,16:20)
whole_res <- map(chunks, ~ sum(as.double(all_values[.x])))
whole_res
# [[1]]
# [1] 97.39217
#
# [[2]]
# [1] 104.7825
#
# [[3]]
# [1] 99.58542
#
# [[4]]
# [1] 99.14727
# => each element is the sum of the values in the chunk
# Now let's use automated chunking
processing <- function(values,index){
sum(as.double(values))
}
# Chunked processing
chunked_res <- read_lines_chunked("some_numbers.txt",
ListCallback$new(processing),
chunk_size = 5)
# the two approaches are equivalent
all.equal(whole_res, chunked_res)
#> [1] TRUE
If we want the results to be in a data frame instead, we can use DataFrameCallback
:
processing <- function(values,index){
c(sum = sum(as.double(values)))
}
# Chunked processing
chunked_res <- read_lines_chunked("some_numbers.txt",
DataFrameCallback$new(processing),
chunk_size = 5)
chunked_res
# sum
# [1,] 97.39217
# [2,] 104.78252
# [3,] 99.58542
# [4,] 99.14727
And if we want the total sum we can use an AccumulateCallBack
(that uses the variable acc
from the previous chunk):
processing <- function(values,index, acc){
acc + sum(as.double(values))
}
# Chunked processing
chunked_res <- read_lines_chunked("some_numbers.txt",
AccumulateCallback$new(processing, acc=0),
chunk_size = 5)
chunked_res
#> 400.9074
# standard sum on the whole file
sum(as.double(all_values))
#> 400.9074
For your specific dataset
You obviously can't load all the json files in memory all at once, since all taken together they make up for 10 GB, which is more than your computer has. But if you only wanted to load "yelp_academic_dataset_checkin.json (428.83 MB)", you should be able to do it (provided you have no other big object in your R session, and you don't have some other software like a web browser that is using all your RAM). In that case you could try something like:
read_json <- function(values, pos){
map_dfr(values, ~ jsonlite::fromJSON(.x))
}
read_lines_chunked("exple2.ndjson.txt",
DataFrameCallback$new(read_json),
chunk_size = 5)
# A tibble: 10 x 2
# business_id date
# <chr> <chr>
# 1 --1UhMGODdWsrMastO9~ 2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:4~
# 2 --6MefnULPED_I942Vc~ 2011-06-04 18:22:23, 2011-07-23 23:51:33, 2012-04-15 01:0~
# 3 --7zmmkVg-IMGaXbuVd~ 2014-12-29 19:25:50, 2015-01-17 01:49:14, 2015-01-24 20:3~
# 4 --8LPVSo5i0Oo61X01s~ 2016-07-08 16:43:30
# 5 --9QQLMTbFzLJ_oT-ON~ 2010-06-26 17:39:07, 2010-08-01 20:06:21, 2010-12-09 21:2~
# 6 --9e1ONYQuAa-CB_Rrw~ 2010-02-08 05:56:47, 2010-02-15 04:47:42, 2010-02-22 03:2~
# 7 --DaPTJW3-tB1vP-Pfd~ 2012-06-03 17:46:09, 2012-08-04 16:19:52, 2012-08-04 16:2~
# 8 --DdmeR16TRb3LsjG0e~ 2012-11-02 21:26:42, 2012-11-02 22:30:43, 2012-11-02 22:4~
# 9 --EF5N7P70J_UYBTPyp~ 2018-05-25 19:52:07, 2018-09-18 16:09:44, 2019-10-18 21:2~
# 10 --EX4rRznJrltyn-34J~ 2010-02-26 17:05:40, 2012-12-29 20:05:04, 2012-12-30 22:0~
(this is for a manually copy-pasted ndjson of the 10 first lines in "yelp_academic_dataset_checkin")
But still note that this would work well for a small-ish file, but you can't do that for too big files, and it won't work for the whole dataset at once. What you need to do is use the callback function to compute some summary statistics, and only combine these while discarding the raw data. What summary statistics make sense depend on your question.