To download this manually requires a kaggle account, which is I'm guessing part of the friction for batches of new users, so the object is to relieve that through a facility that requires only simple authentication.
One solution is to delegate the api call to a proxy server. Once authenticated to the proxy server (by user:password or domain, such as my_univ.edu), the proxy makes the api query on behalf of the user.
There's an R
api to kaggle that could be used . Take Kaggle: Your Home for Data Science as an example. It's a small file named archive.zip
.
After the api authentictes to kaggle
then kaggle::kgl_datasets_download("yamqwe","download/archive.zip)
would download it to be piped to readr::read_csv()
to be unzipped and served.
This keeps the data where it belongs, at the source, dodges the problem of file size limitation on github for a single file (about 50MB) and eliminates problems of adding new datasets, updating old datasets or deleting any datasets that have been withdrawn for some reason.
How to broker the handoff from the proxy to the user? That's a use case for RStudio Server, which has built-in user authentication facilities and can execute the query and download features on the server and return them to the RStudio client window on the user end.
Part of the solution is provided by the api
, which allows programmatic access to Kaggle's data store, but that only trades a web username:password obstacle for an api password obstacle.