when reading a csv.gz file with read_csv, where are the temporary files decompressed?

von_olaf · May 22, 2019, 7:12pm

I am running into the issue of reading several large .gz.csv using read_csv and quickly hitting a no space left error because the files are -- I suppose -- being temporarily decompressed on my /tmp folder.

Is there a way to change that behavior? For instance, decompress the files in the same directory as the gz.csv files? Does that make sense?

EDIT: I now realize this is indeed an issue of redirecting the temporary files to another folder than the default. Is there a way to do so in Rstudio?

Thanks!!

jimhester · May 23, 2019, 10:43am

The tempfile used is always created in the R session's temporary directory. You can use any of the shell environment variables TMPDIR, TMP and TEMP to control where the temporary directory is located, but this only has an effect when the R session first starts. You can't change the temporary directory within a running session.

You can use usethis::edit_r_environ() to open the proper R environment file where you can define one of the variables, e.g.

TMPDIR=/xyz/123

von_olaf · May 23, 2019, 11:57am

really cool!! I knew there was some hidden magic package to do that. Let me try asap

von_olaf · May 23, 2019, 12:10pm

I think there is some interesting following up here. @jimhester is there a way to clean the tmpdir during the session?

Here is the idea. Doing something like list.files('mydir') %>% map(., ~read_csv(.x))

creates issues because when reading hundreds of different gz files, the tmp files accumulate until they reach the disk limit.

I know I can use gc() during the session to clean the environment of unused objects in memory, but is there the equivalent (callable in a loop) that cleans the tmp environment as well?

Thanks!!!!

jimhester · May 24, 2019, 1:03pm

No, there isn't a way to force the cleanup. However readr should automatically clean them up once they are read.

The vroom package has built-in support for reading multiple files, but the files need to persist while the data is being used, they will be cleaned up automatically when the R objects are destroyed. Also if you use vroom you have more control over where the temporary files are created by setting the VROOM_TEMP_PATH environment variable.

von_olaf · May 24, 2019, 1:11pm

thanks!! I ll check it out. Something I quite dont understand is whether vroom is really fast, or this is just because it fakes the loading of the data until you need it. In other words, when the computation really happens then it is as good as readr and others. Am I missing something (I surely am).

jimhester · May 24, 2019, 1:18pm

It is always faster than readr, even if you immediately use all of the data.

von_olaf · May 24, 2019, 1:19pm

thanks. I ll give it a try. amazing job dude

system · May 31, 2019, 1:19pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.