boshek
April 13, 2021, 5:41pm
1
I am trying to use read_lines_chunked
to read in a huge fixed width file in chunks. In my actual example I am doing something to the d
object (hence SideEffectChunkCallback
) but here for this reprex I am simply reading this in. I don't want it to return anything. And yet for whatever reason, R is holding on to a bunch of memory. My assumption is that that was the advantage of reading things in chunks - it doesn't hold them in memory. Am I misunderstanding what's happening here? Am I misunderstanding how to use read_lines_chunked
? Why is R hanging on to that 2 MB? I know this is small but my sense is that that should be much closer to zero?
TIA
Sam
library(readr)
library(lobstr)
#> Warning: package 'lobstr' was built under R version 4.0.5
suppressPackageStartupMessages(library(gdata, warn.conflicts = FALSE))
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
## flexible fn to make fixed width
make_fwf <- function(nrows, file) {
dat <- data.frame(
x = runif(nrows),
y = runif(nrows)
)
gdata::write.fwf(dat, file, colnames = FALSE)
rm(dat)
gc()
file
}
fwf_sample <- make_fwf(1E6, "fwf-eg.fwf")
(start <- mem_used())
#> 60,273,208 B
f <- function(x, pos) {
d <- read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("x", "y")), col_types = c("dd"))
rm(d)
gc()
}
read_lines_chunked(
file = fwf_sample,
callback = SideEffectChunkCallback$new(f),
chunk_size = 50000,
progress = FALSE
)
#> NULL
## Memory taken up
mem_used()
#> 62,609,880 B
## Memory added
mem_used() - start
#> 2,337,608 B
## Size of file
file.info(fwf_sample)$size
#> [1] 2.7e+07
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Canada.1252
#> ctype English_Canada.1252
#> tz America/Los_Angeles
#> date 2021-04-13
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#> cli 2.4.0 2021-04-05 [1] CRAN (R 4.0.4)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> debugme 1.1.0 2017-10-22 [1] CRAN (R 4.0.2)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> gdata * 2.18.0 2017-06-06 [1] CRAN (R 4.0.4)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtools 3.8.2 2020-03-31 [1] CRAN (R 4.0.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.3)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> lobstr * 1.1.1 2019-07-02 [1] CRAN (R 4.0.5)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.4)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R.cache 0.14.0 2019-12-06 [1] CRAN (R 4.0.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
#> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
#> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.0.0)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.4)
#> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.4)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5)
#> vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.5)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Users/salbers/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library
This is just mainly a "cost of doing business". Here's the memory involved simply by having the libraries loaded without doing any read or write operations. As far as the last 2MB that the routine just won't let go, when I ran your code and added mem_used - start()
at the end, it seems to go down pretty quickly
> mem_used() - start
2,512 B
> mem_used() - start
694,384 B
library(lobstr)
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 630527 33.7 1450414 77.5 881415 47.1
#> Vcells 1156304 8.9 8388608 64.0 1847276 14.1
mem_used()
#> 44,597,664 B
library(ggplot2)
mem_used()
#> 56,156,888 B
suppressPackageStartupMessages({
library(gdata)
})
mem_used()
#> 56,662,288 B
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Pop!_OS 20.10
#>
#> Matrix products: default
#> BLAS: /usr/local/lib/R/lib/libRblas.so
#> LAPACK: /usr/local/lib/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gdata_2.18.0 ggplot2_3.3.3 lobstr_1.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.6 pillar_1.6.0 compiler_4.0.4 highr_0.8
#> [5] tools_4.0.4 digest_0.6.27 evaluate_0.14 lifecycle_1.0.0
#> [9] tibble_3.1.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.10
#> [13] reprex_2.0.0 DBI_1.1.1 yaml_2.2.1 xfun_0.22
#> [17] withr_2.4.1 styler_1.4.1 stringr_1.4.0 dplyr_1.0.5
#> [21] knitr_1.31 gtools_3.8.2 generics_0.1.0 fs_1.5.0
#> [25] vctrs_0.3.7 grid_4.0.4 tidyselect_1.1.0 glue_1.4.2
#> [29] R6_2.5.0 fansi_0.4.2 rmarkdown_2.7 purrr_0.3.4
#> [33] magrittr_2.0.1 backports_1.2.1 scales_1.1.1 ellipsis_0.3.1
#> [37] htmltools_0.5.1.1 assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1
#> [41] stringi_1.5.3 munsell_0.5.0 crayon_1.4.1
boshek
April 14, 2021, 4:36am
3
Right I see what you are saying. My problem is that when using read_lines_chunked
on a huge flat file 3GB, I am able to process through the file but at the end I am left with a huge RAM load which doesn't correspond to any objects in the R environment. I think that maybe my reprex isn't capturing my situation correctly. I still think there is a memory leak somewhere and running the process to work with the 3GB file results in a ~3GB memory load. That seems counter to the intent of read_lines_chunked
. I'll try to work on a better reprex.
Yeah, seems odd. One thing I've heard, from Hadley, is that R generally does a good job at garbage collection but that the OS doesn't always cooperate about picking it up and taking it back.
mara
April 14, 2021, 12:51pm
5
Did you see the update to dev readr yesterday that avoids a memory leak when reading chunks? Might be relevant.
committed 08:04PM - 13 Apr 21 UTC
1 Like
boshek
April 14, 2021, 3:58pm
6
Thanks @mara ! Do you have any idea of the release cycle of readr? As in do you know when this bug fix would hit CRAN?
@technocrat it was the compression that was causing my poor reprex. Have a look here:
library(readr)
library(lobstr)
#> Warning: package 'lobstr' was built under R version 4.0.5
suppressPackageStartupMessages(library(gdata, warn.conflicts = FALSE))
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
mem_used()
#> 51,673,568 B
# flexible fn to make fixed width
make_fwf <- function(nrows, file) {
dat <- data.frame(
x = runif(nrows),
y = runif(nrows)
)
gdata::write.fwf(dat, file, colnames = FALSE)
rm(dat)
gc()
R.utils::gzip(file)
}
fwf_sample <- make_fwf(1E6, "fwf-eg.fwf")
(start <- mem_used())
#> 61,211,352 B
f <- function(x, pos) {
d <- read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("x", "y")), col_types = c("dd"))
rm(d)
gc()
}
read_lines_chunked(
file = fwf_sample,
callback = SideEffectChunkCallback$new(f),
chunk_size = 50000,
progress = FALSE
)
#> NULL
## Memory taken up
mem_used()
#> 1,713,272,496 B
## Memory added
mem_used() - start
#> 1,652,060,232 B
## Size of file
file.info(fwf_sample)$size
#> [1] 9579982
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Canada.1252
#> ctype English_Canada.1252
#> tz America/Los_Angeles
#> date 2021-04-14
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#> cli 2.4.0 2021-04-05 [1] CRAN (R 4.0.4)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> debugme 1.1.0 2017-10-22 [1] CRAN (R 4.0.2)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> gdata * 2.18.0 2017-06-06 [1] CRAN (R 4.0.4)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtools 3.8.2 2020-03-31 [1] CRAN (R 4.0.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.3)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> lobstr * 1.1.1 2019-07-02 [1] CRAN (R 4.0.5)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.4)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R.cache 0.14.0 2019-12-06 [1] CRAN (R 4.0.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
#> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
#> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.0.0)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.4)
#> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.4)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5)
#> vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.5)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Users/salbers/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library
mara
April 14, 2021, 4:11pm
7
I don't. I think your best bet is to ask Jim on the GH issue.
1 Like
boshek
April 15, 2021, 6:18pm
8
And just completeness sake, here is the reprex with the fixed version of readr:
library(readr)
library(lobstr)
#> Warning: package 'lobstr' was built under R version 4.0.5
suppressPackageStartupMessages(library(gdata, warn.conflicts = FALSE))
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
mem_used()
#> 54,031,456 B
# flexible fn to make fixed width
make_fwf <- function(nrows, file) {
dat <- data.frame(
x = runif(nrows),
y = runif(nrows)
)
gdata::write.fwf(dat, file, colnames = FALSE)
rm(dat)
gc()
R.utils::gzip(file)
}
fwf_sample <- make_fwf(1E6, "fwf-eg.fwf")
(start <- mem_used())
#> 63,573,320 B
f <- function(x, pos) {
d <- read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("x", "y")), col_types = c("dd"))
rm(d)
gc()
}
read_lines_chunked(
file = fwf_sample,
callback = SideEffectChunkCallback$new(f),
chunk_size = 50000,
progress = FALSE
)
#> NULL
## Memory taken up
mem_used()
#> 65,777,192 B
## Memory added
mem_used() - start
#> 2,202,944 B
## Size of file
file.info(fwf_sample)$size
#> [1] 9580166
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Canada.1252
#> ctype English_Canada.1252
#> tz America/Los_Angeles
#> date 2021-04-15
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#> cli 2.4.0 2021-04-05 [1] CRAN (R 4.0.4)
#> clock 0.2.0 2021-04-12 [1] CRAN (R 4.0.5)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> debugme 1.1.0 2017-10-22 [1] CRAN (R 4.0.2)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> gdata * 2.18.0 2017-06-06 [1] CRAN (R 4.0.4)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtools 3.8.2 2020-03-31 [1] CRAN (R 4.0.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.3)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> lobstr * 1.1.1 2019-07-02 [1] CRAN (R 4.0.5)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.4)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R.cache 0.14.0 2019-12-06 [1] CRAN (R 4.0.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
#> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> readr * 1.4.0.9000 2021-04-15 [1] Github (tidyverse/readr@68c2406)
#> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.0.0)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.4)
#> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.4)
#> tzdb 0.1.0 2021-03-04 [1] CRAN (R 4.0.5)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5)
#> vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.5)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Users/salbers/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library
2 Likes
system
Closed
April 22, 2021, 6:19pm
9
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.