That error message looks like it is coming from the definition of chunk_id in your mutate statement. I will try to overview my understanding / intuition - if I get a chance later, I will take a look at your specific question in particular. In any case, the way I think about any of the chunked functions is as follows:
- The
read_*_chunked function itself takes care of "chopping the file up" into chunks, where a chunk is processed as the read_* part of read_*_chunked indicates (i.e. read_delim_chunked will pass each chunk to read_delim before it goes into the callback)
- Whatever callback I choose (which defines the callback output) will be called for each chunk with the
chunk and pos parameters (where pos is the starting line for the chunk, and chunk is the value of the chunk from the previous bullet)
- Output that is returned from the function depends on the callback that I have selected (i.e.
SideEffectChunkCallback has no output)
Perhaps an example that is more illustrative (I truncated some output for brevity):
library(readr)
write_csv(iris,'tmp_iris.csv')
# simple function to print values
f <- function(chunk, pos) {
print(pos)
print(chunk)
}
# here - each chunk is processed by `read_delim`
# there is no output (just calling for a side-effect each time)
read_delim_chunked(file='tmp_iris.csv'
, callback=SideEffectChunkCallback$new(f)
, delim=','
, chunk_size = 10
)
#> Parsed with column specification:
#> cols(
#> Sepal.Length = col_double(),
#> Sepal.Width = col_double(),
#> Petal.Length = col_double(),
#> Petal.Width = col_double(),
#> Species = col_character()
#> )
#> [1] 1
#> # A tibble: 10 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> [1] 11
#> # A tibble: 10 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.4 3.7 1.5 0.2 setosa
#> 2 4.8 3.4 1.6 0.2 setosa
#> 3 4.8 3.0 1.4 0.1 setosa
#> 4 4.3 3.0 1.1 0.1 setosa
#> 5 5.8 4.0 1.2 0.2 setosa
#> 6 5.7 4.4 1.5 0.4 setosa
#> 7 5.4 3.9 1.3 0.4 setosa
#> 8 5.1 3.5 1.4 0.3 setosa
#> 9 5.7 3.8 1.7 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
...
# here - each chunk is processed by `read_lines`
# there is no output (just calling for a side-effect each time)
read_lines_chunked(file='tmp_iris.csv'
, callback=SideEffectChunkCallback$new(f)
, chunk_size = 10)
#> [1] 1
#> [1] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
#> [2] "5.1,3.5,1.4,0.2,setosa"
#> [3] "4.9,3,1.4,0.2,setosa"
#> [4] "4.7,3.2,1.3,0.2,setosa"
#> [5] "4.6,3.1,1.5,0.2,setosa"
#> [6] "5,3.6,1.4,0.2,setosa"
#> [7] "5.4,3.9,1.7,0.4,setosa"
#> [8] "4.6,3.4,1.4,0.3,setosa"
#> [9] "5,3.4,1.5,0.2,setosa"
#> [10] "4.4,2.9,1.4,0.2,setosa"
#> [1] 11
#> [1] "4.9,3.1,1.5,0.1,setosa" "5.4,3.7,1.5,0.2,setosa"
#> [3] "4.8,3.4,1.6,0.2,setosa" "4.8,3,1.4,0.1,setosa"
#> [5] "4.3,3,1.1,0.1,setosa" "5.8,4,1.2,0.2,setosa"
#> [7] "5.7,4.4,1.5,0.4,setosa" "5.4,3.9,1.3,0.4,setosa"
#> [9] "5.1,3.5,1.4,0.3,setosa" "5.7,3.8,1.7,0.3,setosa"
...
# simple function that returns chunk
return_chunk <- function(chunk, pos) {
return(chunk)
}
# here - processed by `read_delim`
# output is a data.frame (aggregate all chunks together)
output <- read_delim_chunked(file='tmp_iris.csv'
, callback=DataFrameCallback$new(return_chunk)
, delim=','
, chunk_size=10)
#> Parsed with column specification:
#> cols(
#> Sepal.Length = col_double(),
#> Sepal.Width = col_double(),
#> Petal.Length = col_double(),
#> Petal.Width = col_double(),
#> Species = col_character()
#> )
print(output)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
# here - processed by `read_lines`
# output is coerced to a data.frame (ugly)
output2<- read_lines_chunked(file='tmp_iris.csv'
, callback=DataFrameCallback$new(return_chunk)
, chunk_size=10)
print(output2)
#> [,1]
#> [1,] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
#> [2,] "4.9,3.1,1.5,0.1,setosa"
#> [3,] "5.1,3.8,1.5,0.3,setosa"
#> [4,] "4.7,3.2,1.6,0.2,setosa"
#> [5,] "5.1,3.4,1.5,0.2,setosa"
#> [6,] "5,3.3,1.4,0.2,setosa"
#> [7,] "5.2,2.7,3.9,1.4,versicolor"
#> [8,] "5.6,2.5,3.9,1.1,versicolor"
#> [9,] "5.7,2.6,3.5,1,versicolor"
#> [10,] "5.5,2.5,4,1.3,versicolor"
#> [11,] "5.7,2.8,4.1,1.3,versicolor"
#> [12,] "7.2,3.6,6.1,2.5,virginica"
#> [13,] "6,2.2,5,1.5,virginica"
#> [14,] "7.2,3,5.8,1.6,virginica"
#> [15,] "6.9,3.1,5.4,2.1,virginica"
#> [16,] "5.9,3,5.1,1.8,virginica"
#> [,2] [,3]
#> [1,] "5.1,3.5,1.4,0.2,setosa" "4.9,3,1.4,0.2,setosa"
#> [2,] "5.4,3.7,1.5,0.2,setosa" "4.8,3.4,1.6,0.2,setosa"
#> [3,] "5.4,3.4,1.7,0.2,setosa" "5.1,3.7,1.5,0.4,setosa"
#> [4,] "4.8,3.1,1.6,0.2,setosa" "5.4,3.4,1.5,0.4,setosa"
#> [5,] "5,3.5,1.3,0.3,setosa" "4.5,2.3,1.3,0.3,setosa"
#> [6,] "7,3.2,4.7,1.4,versicolor" "6.4,3.2,4.5,1.5,versicolor"
#> [7,] "5,2,3.5,1,versicolor" "5.9,3,4.2,1.5,versicolor"
#> [8,] "5.9,3.2,4.8,1.8,versicolor" "6.1,2.8,4,1.3,versicolor"
#> [9,] "5.5,2.4,3.8,1.1,versicolor" "5.5,2.4,3.7,1,versicolor"
#> [10,] "5.5,2.6,4.4,1.2,versicolor" "6.1,3,4.6,1.4,versicolor"
#> [11,] "6.3,3.3,6,2.5,virginica" "5.8,2.7,5.1,1.9,virginica"
#> [12,] "6.5,3.2,5.1,2,virginica" "6.4,2.7,5.3,1.9,virginica"
#> [13,] "6.9,3.2,5.7,2.3,virginica" "5.6,2.8,4.9,2,virginica"
#> [14,] "7.4,2.8,6.1,1.9,virginica" "7.9,3.8,6.4,2,virginica"
#> [15,] "6.7,3.1,5.6,2.4,virginica" "6.9,3.1,5.1,2.3,virginica"
#> [16,] "5.9,3,5.1,1.8,virginica" "5.9,3,5.1,1.8,virginica"
#> [,4] [,5]
#> [1,] "4.7,3.2,1.3,0.2,setosa" "4.6,3.1,1.5,0.2,setosa"
#> [2,] "4.8,3,1.4,0.1,setosa" "4.3,3,1.1,0.1,setosa"
#> [3,] "4.6,3.6,1,0.2,setosa" "5.1,3.3,1.7,0.5,setosa"
#> [4,] "5.2,4.1,1.5,0.1,setosa" "5.5,4.2,1.4,0.2,setosa"
#> [5,] "4.4,3.2,1.3,0.2,setosa" "5,3.5,1.6,0.6,setosa"
#> [6,] "6.9,3.1,4.9,1.5,versicolor" "5.5,2.3,4,1.3,versicolor"
#> [7,] "6,2.2,4,1,versicolor" "6.1,2.9,4.7,1.4,versicolor"
#> [8,] "6.3,2.5,4.9,1.5,versicolor" "6.1,2.8,4.7,1.2,versicolor"
#> [9,] "5.8,2.7,3.9,1.2,versicolor" "6,2.7,5.1,1.6,versicolor"
#> [10,] "5.8,2.6,4,1.2,versicolor" "5,2.3,3.3,1,versicolor"
#> [11,] "7.1,3,5.9,2.1,virginica" "6.3,2.9,5.6,1.8,virginica"
#> [12,] "6.8,3,5.5,2.1,virginica" "5.7,2.5,5,2,virginica"
#> [13,] "7.7,2.8,6.7,2,virginica" "6.3,2.7,4.9,1.8,virginica"
#> [14,] "6.4,2.8,5.6,2.2,virginica" "6.3,2.8,5.1,1.5,virginica"
#> [15,] "5.8,2.7,5.1,1.9,virginica" "6.8,3.2,5.9,2.3,virginica"
#> [16,] "5.9,3,5.1,1.8,virginica" "5.9,3,5.1,1.8,virginica"
...
All that is important for the callback function (f or return_chunk above) is that it knows how to handle a chunk of data, however it is being passed (i.e. by read_delim or read_lines). I may be misunderstanding how you are doing things, but chunk_id is probably easier to determine by using the pos variable. pos + nrow(chunk) or length(chunk) gives you the end row of the chunk, as well, depending on the function you are using. It is easy enough to determine an "iteration number" from those values if you fix chunk_size.
Now you do have to be careful, because if you are not subsetting with filter, the DataFrameCallback will just stream the whole file into memory (that's what I did above). A classic example is to use SideEffectChunkCallback and insert the rows into a database - if I were doing so in these examples I never would have held more than 10 rows in memory at a given time. The other thing I have done before is stored the line numbers that were interesting to me with the DataFrameCallback and then gone back over the file in a subsequent pass to extract the interesting rows (I could not be certain that the "chunk" of data I wanted would live within one of the "readr chunks"... large, MULTI-line JSON objects!). Lots of interesting approaches to explore!