Download, unzip, and share a large size file in R

anon54831828 · February 12, 2022, 12:23am

I think reproducibility is challenging to achieve for students/researchers who have limited access to bigger data storage, and a very important topic to talk about to get ready to jump into the real-world work setting.
So I wanted to create a thread here to ask questions. Plus, I asked for help on the discussion board for my data science program, but I am not getting any response up until now, so I would like to get some professional advice here on RStudio Community.
Thank you in advance.

I am trying to figure out the code to automatically download and unzip large-size data from Kaggle.com. My goal is to either upload the file to my GitHub repo or create a code for others to download data easily. I tried 3 ways.

First, this code below specifies where the zip file is on the Kaggle website, create a temp file, download it and use read.table() and unzip() to access the data file.

temp <- tempfile()
url <- "https://www.kaggle.com/......../download/archive.zip"
download.file(url, temp)
steam_main_data <- read.table(unzip(temp, files =  "steam_reviews.csv"))
unlink(temp)

This is failing....

Second, I used the pins library, and also uploaded a data file to OneDrive and DropBox, save it as a pinboard, and read that file.

library(pins)
steam_board <- board_url(c("steam_review_board" = "https://www.dropbox.com/........../steam_reviews.csv?dl=0" ))
 steam_main_data_2 <- steam_board %>% 
      pin_read("steam_review_board") %>% 
      as.data.frame()

This is failing also.

The third way is to use a piggyback library to directly upload a file from a local file system to GitHub repo, which I really want this way to work.

library(piggyback)
pb_upload("C:/Users/...../...../...../steam_reviews.csv",
              repo = "my_repo", 
              tag = "v0.0.1")

When I run this code, it looks like it is working, but then the uploading progress shows 1% the whole time and it does not go up.... So this is failing.

I would like to know any other ways to solve this issue.
Thank you again.

technocrat · February 12, 2022, 2:00am

To download this manually requires a kaggle account, which is I'm guessing part of the friction for batches of new users, so the object is to relieve that through a facility that requires only simple authentication.

One solution is to delegate the api call to a proxy server. Once authenticated to the proxy server (by user:password or domain, such as my_univ.edu), the proxy makes the api query on behalf of the user.

There's an R api to kaggle that could be used . Take Kaggle: Your Home for Data Science as an example. It's a small file named archive.zip.

After the api authentictes to kaggle then kaggle::kgl_datasets_download("yamqwe","download/archive.zip) would download it to be piped to readr::read_csv() to be unzipped and served.

This keeps the data where it belongs, at the source, dodges the problem of file size limitation on github for a single file (about 50MB) and eliminates problems of adding new datasets, updating old datasets or deleting any datasets that have been withdrawn for some reason.

How to broker the handoff from the proxy to the user? That's a use case for RStudio Server, which has built-in user authentication facilities and can execute the query and download features on the server and return them to the RStudio client window on the user end.

Part of the solution is provided by the api, which allows programmatic access to Kaggle's data store, but that only trades a web username:password obstacle for an api password obstacle.

Wolfpack2 · February 12, 2022, 1:52pm

I’d setup a scheduled RMD to download the data to the server environment and host it as a pin. The value of using the RMD is you can set tight control over it and monitor execution. You can also control how it gets deployed as a pin.

Doing pin integrations with O365 for anything more than hacky work sounds like a bad idea.

anon54831828 · February 13, 2022, 12:04am

Thank you.

Right now I am trying to install the mkearney/kaggler package.
When I downloaded the package, there were a lot of warning messages, and it seems like it was not downloading properly.
And it doesn't recognize kgl_auth() function to authenticate.
If I figure this part out, then I feel like your suggestion would work. So I will keep looking into what is going on.

technocrat · February 13, 2022, 8:05am

Did you install with

devtools::install_github("mkearney/kaggler")

because

devtools::install_packages("mkearney/kaggler")

doesn't work. The kgl_auth() function appears in the repo, so it looks as if the installation was incomplete.

anon54831828 · February 14, 2022, 12:01am

Thank you.
Now I got this part! Downloaded using devtools::install_github("mkearney/kaggler").
API Auth is done.

And I run the following code to get the data from Kaggle (not including piping to read_csv() just yet).

kaggler::kgl_datasets_download("luthfim", "steam-reviews-dataset/download/archive.zip")

Then I run into another error saying;

Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Users/...../OneDrive/Documents/.Renviron': No such file or directory

I am starting to think that maybe the problem is my file path(??) or the way my computer tries to download and save the file in the location that I don't want (??)...
And, this code above probably works for other people as long as they have auth set up.

I will look into this.

Thank you again.

anon54831828 · February 14, 2022, 12:10am

Thank you for sharing.
That is interesting. I will look into a scheduled Rmarkedown option.

I was also thinking that since I also have RStudio Cloud to store data and my project file, I started to be curious if anyone has done sharing a data file straight from RStudio Cloud with a few codes.

Wolfpack2 · February 14, 2022, 12:51am

Some food for thought -

What you are trying to do is create a data pipeline. As you’ve laid out, there are many pathways to perform the task. In my experience, you want to carefully choose your approach so that it is the minimum level of complexity with the maximum amount of reliability. The wild card that usually changes those two parameters is how the data is to be consumed.

For example, how do you want people to access your data? If this is a static file that you want people to get with their code, solving the GitHub file upload sounds like the best approach.

If the data changes over time, you likely need to think of how you want people to access and download it. Exposing as a pin with a scheduled RMD is going to be the most R friendly, ubiquitous solution.

But what if users want to use Python? Pins (to my knowledge) are an R-thing. You might consider exposing data though a REST API so it can be built into downstream applications.

I hope this gives you some sense about choosing design patterns. I’ve seen many people design needlessly complex pipelines because they never stopped to evaluate the problem in terms of reliability, complexity and consumption.

anon54831828 · February 14, 2022, 1:39am

That " the minimum level of complexity with the maximum amount of reliability" is very very helpful and makes me think about what I am trying to achieve here.
The data doesn't change over time, and people access the data and project using R 100%, so I was assuming it doesn't have to be so complicated.

I have 1 or 2 potential pipelines that could work, but I am just not seeing them successful on my end.

Although the data is static, I'll still research and look into the scheduled Rmd that you have mentioned. I've never done it before and honestly, have no idea about it at this moment.

Thank you.

technocrat · February 14, 2022, 1:41am

I can't speculate on the specifics involved in running a script against a virtual folder under Win because I don't use Win. In this situation, I would start off on a local drive, preferably the default RStudio working directory.

technocrat · February 14, 2022, 1:46am

It would be possible to mirror kaggle (consistent with its term of service) or a subset in, say, an AWS bucket and that would shorten the pipeline. The tradeoff is between storage size and cost and maximum flexibility. My approach would be to download on demand at schedule intervals ("can I have the XYZ dataset", asks instructor A).

anon54831828 · February 14, 2022, 7:23pm

I am simply curious, do you use Linux?

Because I hear more people using Linux over Win, and even I found one of the data science-related job postings saying Linux-only work environment.

Thank you.

technocrat · February 14, 2022, 7:48pm

Yes, since before Linux when it was Unix. An easy way to get started is with with a USB stick

anon54831828 · February 14, 2022, 8:57pm

Now I am trying to solve GitHub file upload, as you suggested for a static file, and I think have a few potential paths that could work...
Anyway, Thank you.

system · March 7, 2022, 8:58pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.