That error message looks like it is coming from the definition of chunk_id
in your mutate
statement. I will try to overview my understanding / intuition - if I get a chance later, I will take a look at your specific question in particular. In any case, the way I think about any of the chunked functions is as follows:
- The
read_*_chunked
function itself takes care of "chopping the file up" into chunks, where a chunk is processed as the read_*
part of read_*_chunked
indicates (i.e. read_delim_chunked
will pass each chunk to read_delim
before it goes into the callback)
- Whatever callback I choose (which defines the callback output) will be called for each chunk with the
chunk
and pos
parameters (where pos
is the starting line for the chunk, and chunk
is the value of the chunk from the previous bullet)
- Output that is returned from the function depends on the callback that I have selected (i.e.
SideEffectChunkCallback
has no output)
Perhaps an example that is more illustrative (I truncated some output for brevity):
library(readr)
write_csv(iris,'tmp_iris.csv')
# simple function to print values
f <- function(chunk, pos) {
print(pos)
print(chunk)
}
# here - each chunk is processed by `read_delim`
# there is no output (just calling for a side-effect each time)
read_delim_chunked(file='tmp_iris.csv'
, callback=SideEffectChunkCallback$new(f)
, delim=','
, chunk_size = 10
)
#> Parsed with column specification:
#> cols(
#> Sepal.Length = col_double(),
#> Sepal.Width = col_double(),
#> Petal.Length = col_double(),
#> Petal.Width = col_double(),
#> Species = col_character()
#> )
#> [1] 1
#> # A tibble: 10 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> [1] 11
#> # A tibble: 10 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.4 3.7 1.5 0.2 setosa
#> 2 4.8 3.4 1.6 0.2 setosa
#> 3 4.8 3.0 1.4 0.1 setosa
#> 4 4.3 3.0 1.1 0.1 setosa
#> 5 5.8 4.0 1.2 0.2 setosa
#> 6 5.7 4.4 1.5 0.4 setosa
#> 7 5.4 3.9 1.3 0.4 setosa
#> 8 5.1 3.5 1.4 0.3 setosa
#> 9 5.7 3.8 1.7 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
...
# here - each chunk is processed by `read_lines`
# there is no output (just calling for a side-effect each time)
read_lines_chunked(file='tmp_iris.csv'
, callback=SideEffectChunkCallback$new(f)
, chunk_size = 10)
#> [1] 1
#> [1] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
#> [2] "5.1,3.5,1.4,0.2,setosa"
#> [3] "4.9,3,1.4,0.2,setosa"
#> [4] "4.7,3.2,1.3,0.2,setosa"
#> [5] "4.6,3.1,1.5,0.2,setosa"
#> [6] "5,3.6,1.4,0.2,setosa"
#> [7] "5.4,3.9,1.7,0.4,setosa"
#> [8] "4.6,3.4,1.4,0.3,setosa"
#> [9] "5,3.4,1.5,0.2,setosa"
#> [10] "4.4,2.9,1.4,0.2,setosa"
#> [1] 11
#> [1] "4.9,3.1,1.5,0.1,setosa" "5.4,3.7,1.5,0.2,setosa"
#> [3] "4.8,3.4,1.6,0.2,setosa" "4.8,3,1.4,0.1,setosa"
#> [5] "4.3,3,1.1,0.1,setosa" "5.8,4,1.2,0.2,setosa"
#> [7] "5.7,4.4,1.5,0.4,setosa" "5.4,3.9,1.3,0.4,setosa"
#> [9] "5.1,3.5,1.4,0.3,setosa" "5.7,3.8,1.7,0.3,setosa"
...
# simple function that returns chunk
return_chunk <- function(chunk, pos) {
return(chunk)
}
# here - processed by `read_delim`
# output is a data.frame (aggregate all chunks together)
output <- read_delim_chunked(file='tmp_iris.csv'
, callback=DataFrameCallback$new(return_chunk)
, delim=','
, chunk_size=10)
#> Parsed with column specification:
#> cols(
#> Sepal.Length = col_double(),
#> Sepal.Width = col_double(),
#> Petal.Length = col_double(),
#> Petal.Width = col_double(),
#> Species = col_character()
#> )
print(output)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
# here - processed by `read_lines`
# output is coerced to a data.frame (ugly)
output2<- read_lines_chunked(file='tmp_iris.csv'
, callback=DataFrameCallback$new(return_chunk)
, chunk_size=10)
print(output2)
#> [,1]
#> [1,] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
#> [2,] "4.9,3.1,1.5,0.1,setosa"
#> [3,] "5.1,3.8,1.5,0.3,setosa"
#> [4,] "4.7,3.2,1.6,0.2,setosa"
#> [5,] "5.1,3.4,1.5,0.2,setosa"
#> [6,] "5,3.3,1.4,0.2,setosa"
#> [7,] "5.2,2.7,3.9,1.4,versicolor"
#> [8,] "5.6,2.5,3.9,1.1,versicolor"
#> [9,] "5.7,2.6,3.5,1,versicolor"
#> [10,] "5.5,2.5,4,1.3,versicolor"
#> [11,] "5.7,2.8,4.1,1.3,versicolor"
#> [12,] "7.2,3.6,6.1,2.5,virginica"
#> [13,] "6,2.2,5,1.5,virginica"
#> [14,] "7.2,3,5.8,1.6,virginica"
#> [15,] "6.9,3.1,5.4,2.1,virginica"
#> [16,] "5.9,3,5.1,1.8,virginica"
#> [,2] [,3]
#> [1,] "5.1,3.5,1.4,0.2,setosa" "4.9,3,1.4,0.2,setosa"
#> [2,] "5.4,3.7,1.5,0.2,setosa" "4.8,3.4,1.6,0.2,setosa"
#> [3,] "5.4,3.4,1.7,0.2,setosa" "5.1,3.7,1.5,0.4,setosa"
#> [4,] "4.8,3.1,1.6,0.2,setosa" "5.4,3.4,1.5,0.4,setosa"
#> [5,] "5,3.5,1.3,0.3,setosa" "4.5,2.3,1.3,0.3,setosa"
#> [6,] "7,3.2,4.7,1.4,versicolor" "6.4,3.2,4.5,1.5,versicolor"
#> [7,] "5,2,3.5,1,versicolor" "5.9,3,4.2,1.5,versicolor"
#> [8,] "5.9,3.2,4.8,1.8,versicolor" "6.1,2.8,4,1.3,versicolor"
#> [9,] "5.5,2.4,3.8,1.1,versicolor" "5.5,2.4,3.7,1,versicolor"
#> [10,] "5.5,2.6,4.4,1.2,versicolor" "6.1,3,4.6,1.4,versicolor"
#> [11,] "6.3,3.3,6,2.5,virginica" "5.8,2.7,5.1,1.9,virginica"
#> [12,] "6.5,3.2,5.1,2,virginica" "6.4,2.7,5.3,1.9,virginica"
#> [13,] "6.9,3.2,5.7,2.3,virginica" "5.6,2.8,4.9,2,virginica"
#> [14,] "7.4,2.8,6.1,1.9,virginica" "7.9,3.8,6.4,2,virginica"
#> [15,] "6.7,3.1,5.6,2.4,virginica" "6.9,3.1,5.1,2.3,virginica"
#> [16,] "5.9,3,5.1,1.8,virginica" "5.9,3,5.1,1.8,virginica"
#> [,4] [,5]
#> [1,] "4.7,3.2,1.3,0.2,setosa" "4.6,3.1,1.5,0.2,setosa"
#> [2,] "4.8,3,1.4,0.1,setosa" "4.3,3,1.1,0.1,setosa"
#> [3,] "4.6,3.6,1,0.2,setosa" "5.1,3.3,1.7,0.5,setosa"
#> [4,] "5.2,4.1,1.5,0.1,setosa" "5.5,4.2,1.4,0.2,setosa"
#> [5,] "4.4,3.2,1.3,0.2,setosa" "5,3.5,1.6,0.6,setosa"
#> [6,] "6.9,3.1,4.9,1.5,versicolor" "5.5,2.3,4,1.3,versicolor"
#> [7,] "6,2.2,4,1,versicolor" "6.1,2.9,4.7,1.4,versicolor"
#> [8,] "6.3,2.5,4.9,1.5,versicolor" "6.1,2.8,4.7,1.2,versicolor"
#> [9,] "5.8,2.7,3.9,1.2,versicolor" "6,2.7,5.1,1.6,versicolor"
#> [10,] "5.8,2.6,4,1.2,versicolor" "5,2.3,3.3,1,versicolor"
#> [11,] "7.1,3,5.9,2.1,virginica" "6.3,2.9,5.6,1.8,virginica"
#> [12,] "6.8,3,5.5,2.1,virginica" "5.7,2.5,5,2,virginica"
#> [13,] "7.7,2.8,6.7,2,virginica" "6.3,2.7,4.9,1.8,virginica"
#> [14,] "6.4,2.8,5.6,2.2,virginica" "6.3,2.8,5.1,1.5,virginica"
#> [15,] "5.8,2.7,5.1,1.9,virginica" "6.8,3.2,5.9,2.3,virginica"
#> [16,] "5.9,3,5.1,1.8,virginica" "5.9,3,5.1,1.8,virginica"
...
All that is important for the callback function (f
or return_chunk
above) is that it knows how to handle a chunk of data, however it is being passed (i.e. by read_delim
or read_lines
). I may be misunderstanding how you are doing things, but chunk_id
is probably easier to determine by using the pos
variable. pos
+ nrow(chunk)
or length(chunk)
gives you the end row of the chunk, as well, depending on the function you are using. It is easy enough to determine an "iteration number" from those values if you fix chunk_size
.
Now you do have to be careful, because if you are not subsetting with filter
, the DataFrameCallback
will just stream the whole file into memory (that's what I did above). A classic example is to use SideEffectChunkCallback
and insert the rows into a database - if I were doing so in these examples I never would have held more than 10 rows in memory at a given time. The other thing I have done before is stored the line numbers that were interesting to me with the DataFrameCallback
and then gone back over the file in a subsequent pass to extract the interesting rows (I could not be certain that the "chunk" of data I wanted would live within one of the "readr chunks"... large, MULTI-line JSON objects!). Lots of interesting approaches to explore!