Uploading a large file

Tess_Baker · October 29, 2020, 3:03pm

Hello,

I have a file that is too large to read into R all at once, so I've been using the read_csv_chunked function to do it bits at a time. I had just been working on the desktop version of R Studio, but even then it would still be going after 3–4 days, so my advisor set me up with a google cloud compute to try to get the job done without tying up my laptop. The only problem is the file (a csv) is on my computer and it's too large to upload it into R Studio cloud the usual way and read in into the environment. Is there any way to be able to read files with the read_csv_chunked from my computer, or, alternatively are there any good work arounds to this problem? Any help would be much appreciated ! Thank you !

nirgrahamuk · October 29, 2020, 3:07pm

I would try the vroom package for fast reading of CSV files on the desktop.

Tess_Baker · October 29, 2020, 3:10pm

I actually did try that on the desktop version before read_csv_chunked! Really the problem right now is the file is on my computer and it's too large to upload into R Studio cloud, so I'm wondering what a good way to get it into R studio cloud without uploading all of it, or something along those lines. Thank you for your response though !

nirgrahamuk · October 29, 2020, 3:12pm

how big is your data ?

Tess_Baker · October 29, 2020, 3:14pm

It is just under 43 GB

nirgrahamuk · October 29, 2020, 3:25pm

wow, so, to transfer your data across your network will take time.
I suppose you can estimate by finding your upload speed, from somewhere like https://www.speedtest.net/

Probably you could compress/zip your file if you were comitted to sending it. might be worth testing on
some number of chunks worth of your daa (zipped and unzipped) to see if the upload times differ significantly.

putting aside the network transfer challenge if you wanted to try another way to access the large csv data on your desktop, I would look at if package mmap, with its mmap.csv function would help.

Tess_Baker · October 29, 2020, 3:27pm

Okay awesome thank you so much for these suggestions !

jrkrideau · October 29, 2020, 6:51pm

43 GB
Ouch.
Would it be possible to read the .csv file into a database and use one of the various R db connections to read in data as needed?

Tess_Baker · October 30, 2020, 2:24am

Hey–that's a good idea I think, thank you ! I've never really done that before–is there a way to read chunks at a time, subset, then move to the next chunk?

jrkrideau · October 30, 2020, 10:57am

is there a way to read chunks at a time, subset, then move to the next chunk?

Do you mean from the database to R? I am no db expert---I do have a copy of SQL for Dummies---but I don't see why not. It should be more efficient to do the data selection (i.e. chunking) and subsetting in the the database and just import the exact data you want to work with. I think it should reduce memory load and speed up processing time.

There are a number of decent tutorials on the web on working with R and databases on the web; https://www.pluralsight.com/guides/importing-data-from-relational-databases-in-r seem useful and this one gives more detail. https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html#Introduction

lucha6 · November 1, 2020, 6:21pm

I have never had to handle such large files but I have heard great things about the arrow package regarding speed. I would advise trying it, it comes with a handy read_csv_arrow function: https://ursalabs.org/arrow-r-nightly/reference/read_delim_arrow.html

benjaminhlina · November 3, 2020, 1:54am

Check out the package data.table as it’s much more robust than readr at reading and processing large data such as your own. dplyr and readr are good but for something that is that large your best bet is something backended with C which will be much faster. Your alternative is SQL or a database linking package as suggested earlier.

jrkrideau · November 6, 2020, 5:02am

I just saw a notice for what seews like a new package that may be worth looking into. It is called disk.frame . https://github.com/xiaodaigh/disk.frame

system · November 27, 2020, 5:02am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.