I have a file that is too large to read into R all at once, so I've been using the read_csv_chunked function to do it bits at a time. I had just been working on the desktop version of R Studio, but even then it would still be going after 3–4 days, so my advisor set me up with a google cloud compute to try to get the job done without tying up my laptop. The only problem is the file (a csv) is on my computer and it's too large to upload it into R Studio cloud the usual way and read in into the environment. Is there any way to be able to read files with the read_csv_chunked from my computer, or, alternatively are there any good work arounds to this problem? Any help would be much appreciated ! Thank you !
I actually did try that on the desktop version before read_csv_chunked! Really the problem right now is the file is on my computer and it's too large to upload into R Studio cloud, so I'm wondering what a good way to get it into R studio cloud without uploading all of it, or something along those lines. Thank you for your response though !
wow, so, to transfer your data across your network will take time.
I suppose you can estimate by finding your upload speed, from somewhere like https://www.speedtest.net/
Probably you could compress/zip your file if you were comitted to sending it. might be worth testing on
some number of chunks worth of your daa (zipped and unzipped) to see if the upload times differ significantly.
putting aside the network transfer challenge if you wanted to try another way to access the large csv data on your desktop, I would look at if package mmap, with its mmap.csv function would help.
Hey–that's a good idea I think, thank you ! I've never really done that before–is there a way to read chunks at a time, subset, then move to the next chunk?
is there a way to read chunks at a time, subset, then move to the next chunk?
Do you mean from the database to R? I am no db expert---I do have a copy of SQL for Dummies---but I don't see why not. It should be more efficient to do the data selection (i.e. chunking) and subsetting in the the database and just import the exact data you want to work with. I think it should reduce memory load and speed up processing time.
Check out the package data.table as it’s much more robust than readr at reading and processing large data such as your own. dplyr and readr are good but for something that is that large your best bet is something backended with C which will be much faster. Your alternative is SQL or a database linking package as suggested earlier.