I have 5,3GB of data, 35.360 zip files with 1 csv file inside each of them, all organized inside 41 folders, those are log files. The file names are organized like this:
Folder 2018-10-25:
2018-10-25-00-00-0e41.csv.gz;
2018-10-25-00-00-7f7d.csv.gz;
2018-10-25-00-00-32fa.csv.gz and so on;
Folder 2018-10-26:
2018-10-26-00-00-0e41.csv.gz;
2018-10-26-00-00-7f7d.csv.gz;
2018-10-26-00-00-32fa.csv.gz and so on.
Last folder 2018-12-04.
How can I read all those files in R as just one file? Any tips to work with such a great amount of data?
5.3GB is a lot of data; you might have problems fitting it into memory. Do you have access to a database? Databases play nicely with dplyr (via DBI / dbplyr) and can do a lot of heavy lifting for you.
My approach would be along these lines:
zipfiles <- list.files(pattern= '*.gz') # data frame of zip files in current directory
for (i in seq_along(zipfiles)) {
unzip(zipfiles[i], files = 'name_of_yer_file.csv', exdir = tempdir(), junkpaths = T)
# your csv file will be unzipped to tempdir
# somehow insert the content of the csv file to your database
# this will depend on its structure and your database of choice
}
Also, if you were willing to risk the Purity of Essence of your code consider this script.
It is written in the language of the snake people, and easily integrated with R code. I have used it with great success when parsing S3 logs; I am certain it can be used for other log structures with only minor hacking.