Download "files" that are a list of links

Ranae · January 17, 2018, 11:56pm

I have a long list of links. Each one is for a NetCDF file. If I put a link in my browser, a file automatically starts downloading, but my browser doesn't go anywhere.

What are these, links, or files? How do I read them in R?

When I try RCurl::getURL(), I get

Error in nc_open trying to open file <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/. .. etc

I have all the links in a folder called "myfiles.dat". Hoping to move ahead and learn purrr with this set.

alistaire · January 18, 2018, 12:21am

Looks like things have moved:

Given the oauth in the link, even if you get there, you'll probably need to log in or provide an access token.

nviau · January 18, 2018, 12:21am

Can you provide a list of example links?

Ranae · January 18, 2018, 12:34am

Here is a full link:

http://goldsmr2.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FMERRA%2FMST1NXMLD.5.2.0%2F2004%2F01%2FMERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.hdf&FORMAT=bmM0Lw&BBOX=45.687%2C-95.804%2C45.694%2C-95.794&LABEL=MERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.SUB.nc4&SHORTNAME=MST1NXMLD&SERVICE=SUBSET_MERRA&VERSION=1.02&LAYERS=&VARIABLES=tsoil1

If I copy and paste it into my browser, a file still downloads.

@alistaire, you're right. I do need to log in to get this information. Now I understand that RCurl won't work without the login. They have some short instructions how to download all the files using Unix. Looks like I will probably have to contact them to see if there is a way around it since I am not a Unix user.

alistaire · January 18, 2018, 12:42am

If you use Windows, the Windows Subsystem for Linux will let you run anything you could need.

If you run MacOS or Linux, they're built on top of Unix, and so are ready to go.

In all likelihood, you could do this all directly from R with httr, but it may still take some work.

Ranae · January 18, 2018, 1:17am

Thanks. The data are so close. . .yet so far away.

rensa · January 18, 2018, 2:40am

NetCDFs are popular with climate scientists and almost nobody else If you need advice on getting started with whichever product this is (eg. accessing it), I can ask around the office and see if someone's used it before.

The ncdf4 R package works well with NetCDFs, as does purrr (in fact, I * cough * just wrote a blog post on using purrr with file formats like NetCDF ). Someone's also working on a package called tidync to make dealing with NetCDFs easier still, but I'm not sure how far along it is.

Ranae · January 18, 2018, 7:13pm

This blog post looks incredibly helpful! Thank you!

As for accessing the data, Scott Chamberlain has been working on it and sounds like there might be an interface for this dataset: https://github.com/ropensci/dappr .

hadley · January 18, 2018, 11:38pm

I see this when I try and access that file:

r <- httr::GET("http://goldsmr2.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FMERRA%2FMST1NXMLD.5.2.0%2F2004%2F01%2FMERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.hdf&FORMAT=bmM0Lw&BBOX=45.687%2C-95.804%2C45.694%2C-95.794&LABEL=MERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.SUB.nc4&SHORTNAME=MST1NXMLD&SERVICE=SUBSET_MERRA&VERSION=1.02&LAYERS=&VARIABLES=tsoil1")
r
#> Response [https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&app_type=401&client_id=e2WVk8Pw6weeLUKZYOxvTQ&response_type=code&redirect_uri=http%3A%2F%2Fgoldsmr2.gesdisc.eosdis.nasa.gov%2Fdata-redirect&state=aHR0cHM6Ly9nb2xkc21yMi5nZXNkaXNjLmVvc2Rpcy5uYXNhLmdvdi9kYWFjLWJpbi9PVEYvSFRUUF9zZXJ2aWNlcy5jZ2k%2FRklMRU5BTUU9JTJGZGF0YSUyRk1FUlJBJTJGTVNUMU5YTUxELjUuMi4wJTJGMjAwNCUyRjAxJTJGTUVSUkEzMDAucHJvZC5zaW11bC50YXZnMV8yZF9tbGRfTnguMjAwNDAxMDEuaGRmJkZPUk1BVD1ibU0wTHcmQkJPWD00NS42ODclMkMtOTUuODA0JTJDNDUuNjk0JTJDLTk1Ljc5NCZMQUJFTD1NRVJSQTMwMC5wcm9kLnNpbXVsLnRhdmcxXzJkX21sZF9OeC4yMDA0MDEwMS5TVUIubmM0JlNIT1JUTkFNRT1NU1QxTlhNTEQmU0VSVklDRT1TVUJTRVRfTUVSUkEmVkVSU0lPTj0xLjAyJkxBWUVSUz0mVkFSSUFCTEVTPXRzb2lsMQ]
#>   Date: 2018-01-18 23:36
#>   Status: 401
#>   Content-Type: text/html; charset=utf-8
#>   Size: 27 B
#> HTTP Basic: Access denied.

This suggests that you've logged into the site in your browser and it's probably using cookies to remember you.

It might be possible to automate the log-in and download process with rvest, but if you haven't done any webscraping before it's going to be quite a lot of work (and I don't think there's a good single resources where you can learn the basics)

mdsumner · June 26, 2018, 7:21am

ncdf4 will now read this thredds/dap source directly, so I'd try using raster:: raster on the link e.g. https://rpubs.com/cyclemumner/380576

This topic is fraught though, lots of options lots of piecemeal history and lots of confusion

tristanfabre · July 29, 2019, 12:30pm

I've tried this but I cannot access to the final *.nc file:

lk <- "https://yourURL.com"
library(httr)

r <- GET(lk, 
         authenticate("myID", "myPWD"),
         path = "~/Desktop")

# or this with curl
library(curl)

h <- curl::new_handle()
curl::handle_setopt(
  handle = h,
  httpauth = 1,
  userpwd = "myID:myPWD"
)

resp <- curl::curl_fetch_memory(lk, handle = h)

Where "myID" and "myPWD" are my earthdata ID.

I suggest you using wget, you just need to follow these steps.

system · August 19, 2019, 12:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.