Reading gzipped file from remote connection

cbrnr · August 18, 2023, 9:17am

I know that readr::read_csv() supports reading gzipped files, but for some reason, I get an error with the following URL:

https://ec.europa.eu/eurostat/databrowser-backend/api/extraction/1.0/LIVE/true/sdmx/csv/PRC_HICP_MIDX?i&compressed=true

Opening this URL in a browser downloads a (large) gzipped file called PRC_HICP_MIDX (no extension), which is a gzipped text (CSV) file.

I can load it fine with

readr::read_csv("PRC_HICP_MIDX")

However, I get an error when trying to read it remotely with

readr::read_csv("https://ec.europa.eu/eurostat/databrowser-backend/api/extraction/1.0/LIVE/true/sdmx/csv/PRC_HICP_MIDX?i&compressed=true")

Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names,  :                                                        
  embedded nul in string: '\037\x8b\b\0\0\0\0\0\0\0\xac\xbd͎.\xbd\x92^77\xe0{\xd0\xd0\006ʥ\x97Lf&\xa9ّ\xfa\xa8\xd5\xd0\021\xd4\xd0\xe9\xf6\x8f&=\020dC\023\xcb\026\xec\xfbw2"\x9e\xc8*\017\xbf\025\xc0\036\xd4`c!\x93/\x83d\x92\xc1\025\177\xf3\xa7\177\xf8ӿ\xfe˿\xff_\xbe\xfe\xf2\xa7\xbf\xfe\xc3?\xfbǿ\xff\x9b?\xfdß\xbf\xfe\xf7\xff\xf6\x9f\xff\xef\xaf\xff\xf7\xff\xfc/\xff\xcf\xd7\177\xfa\xaf\xff\xe5?\xfd\xd7\xff\xeb\xeb\xff\xf8\xcf\xff\xf5\xeb\037\xfe\xee\xdf\xfd\xf9\x9f\xfe\xfe\xcf\xff\xe1\xef\xfe\xfd\xdf|\xfd\xfb\177\xf9\xd7\177\xfa\x9f\xff\xf4\x97\177\xfc\xb3\xfd\xf5\xaf\xff\xf2\xa7\xbf\xfd\xef\xff\xbb?\xff\xf5\037\xfe\xf4\017\xff\xe2\xef\xffÿ\xfa\xa7\177\xf3w\xff\xea\xef\xff\xe9\xdf\xfd\xdd\xdf\xfc\xaf\xffC\xfb\xfe\xfc\x8f_m\xfe\xf3\xcf\xfc\xe7\xfd\xf8g\xad\xfd\x8b\xcf\xe7\xf9\xf7\xf5\xef\xbe\xfe\xees~\xfd\xab\xbf\177\xfe\xfc\xd3?|\xb5\xb5\xae\xff\xe9Ӿ\xe6\xf5}\xcd/\f\xea\033\xb4\006\a\035_\xf3\xfe\xee\037\016\032\033\xd4\032\a\x9d\033\xf4<\027\006]\xf6j\x8b\x83\xee\r:'

Is loading from that URL supposed to work, and if so, why does it produce an error?

AlexisW · August 18, 2023, 2:55pm

I think what's happening is that {readr} first looks at the parameter you provide. It notices that it starts with an "https://", so it's a url and needs downloading. This, readr looks at the extension, and doesn't find any (no ".gz" at the end), so it assumes it's text, downloads it as text, and then you get an unusable file.

So in your case, you need to tell readr explicitly that what you want to download is a gzip file:


my_url <- "https://ec.europa.eu/eurostat/databrowser-backend/api/extraction/1.0/LIVE/true/sdmx/csv/PRC_HICP_MIDX?i&compressed=true"

readr::read_csv(gzcon(url(my_url)))

cbrnr · August 18, 2023, 4:06pm

Thank you so much, this works! Quick follow-up question, why does readr::read_csv() know that the local file (without any extension) is a gzip file? What is the technical reason that this does not work for remote connections?

AlexisW · August 18, 2023, 4:23pm

You'd want to read the source code of read_csv() to know for sure, my guess is it's using something like file("PRC_HICP_MIDX") to get details, which in turn will work a bit like Linux's file command and use various approaches (including reading the header at the beginning of the binary file) to guess the filetype.

When you give an url, there is no direct access to the file content, so readr makes a guess purely based on the url. If it was an important enough use case, I image readr could download the beginning of the file, look at it to guess the type, discard it, and then download again the right way. I imagine that would come with its own problems, and is probably uncommon enough that they didn't implement that.

AlexisW · August 18, 2023, 4:33pm

Specifically, in a Linux terminal, you can look at the binary file with:

$ hexdump PRC_HICP_MIDX2 | head
0000000 8b1f 0008 0000 0000 0000 bdac 8ecd bd2e
0000010 5e92 3737 7be0 d0d0 ca06 97a5 664c a926

Notice the first two bytes are 8b 1f, which is the so-called "magic number" for gzip as you can see in this list.

cbrnr · August 18, 2023, 4:58pm

This makes perfect sense, thank you for your help!

system · August 25, 2023, 4:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.