Stream 500 GB Compressed JSON (gz) Without Downloading File?

DarkHelmet · July 12, 2022, 1:24pm

There is a 500 GB compressed JSON file (.gz) I'd like to analyze. Is there anyway to use R, such as with jsonlite, to read in only parts of the file? Or do I have to download the entire file and decompress?

Here's the link:
https://d25kgz5rikkq4n.cloudfront.net/cost_transparency/mrf/in-network-rates/reporting_month=2022-07/2022-07-01_cigna-health-life-insurance-company_national-oap_in-network-rates.json.gz?Expires=1659190085&Signature=GQUiH7SrLCSVd0u9qXyAuvpqlFcbrs96gUF8bo~y5V3RmHQjiEFKyHo-5-p3lhTKFOvDchwUvzwWACmMHH~dpRW51RzYfVTj5e4WbC9UHLtIwqViL2rRxMWRm0oQi0lQOaJk-3~ecJMTIFQIv7zrqmel0IauazF49G2FjEj3vQfJMzgOirWyIuOPpTJr12MgtsdSr7-7N-hhyqbO6i3ePssS72q6TKyD0QMXxB2csQLZwTnnxp-dEJvULvb3n8x9XYGQ2TtAlMXZMGoqukVuY8jD17UrMqd9BP~g35BZYTdyANkIJkp2-1WAL5Dvqevm4HptXcmE25BysLb0UW6xAg__&Key-Pair-Id=K1NVBEPVH9LWJP

AlexisW · July 13, 2022, 1:50am

Does your computer have several 100 GB RAM? With a cursory look, most of the memory in the JSON is used by actual data (not just boilerplate), so, even if you streamed it, I don't see how you could keep the contents of that file in memory.

Assuming you are working with a desktop computer, I see two possibilities: either you're trying to recover a few specific values (so you could just stream, picking up what's needed while discarding the rest), or you want to have access to the entire dataset.

Entire dataset

In the second case, I think the easiest is to use a database like sqlite or duckdb. It will require downloading the entire file and storing it to disk, but then you can use pointed commands to extract anything you need. You can find many ways to do it here.

Extracting values on-the-fly

It seems to me that file is not in a format like ndjson where each line can be interpreted independently. So, yes, you can read parts of the file, but parsing the json might be a problem since you get unmatched { or }.

If the data is in ndjson, there is jsonlite::stream_in(). But when I try it on this link, I get an error similar to that one.

An alternative is to use scan() to directly read the text:

url <- "https://d25kgz5rikkq4n.cloudfront.net/cost_tra..."


mytext <- scan(gzcon(url(url)),
           what = character(),
           sep = "\n",
           skip = 1L,
           n = 2L)
nchar(mytext)
#> [1] 1124838   98110

substr(mytext[[1]], 1, 9)
#> [1] "{\"negotia"

myjson <- jsonlite::fromJSON(mytext[[2]])
#> Error: parse error: trailing garbage
#>           ditional_information":""}]}]},
#>                      (right here) ------^

^{Created on 2022-07-12 by the reprex package (v2.0.1)}

But you still have a problem when trying to parse the JSON line-by-line. However if you know what you're looking for, you can use sep = "," and look at each json record while it passes by, and only save it if needed. That would take some programming.

DarkHelmet · July 13, 2022, 9:08am

Thanks for the insights AlexisW. First I will concede that I am not very familiar with JSON in general, let alone with a file of this size (perhaps I should be first starting with a smaller one).

No, I do not have 100 GB's of RAM, just a regular laptop with 16 GB of RAM. I don't really need the entire data set, I am interested in a small subset of records. But I don't even know the column names so first just need a sample.

What is the significance of the data not being boilerplate? Again, my ignorance on JSON basics.

Thanks for the suggestion on sqlite or duckdb and associated stackoverflow link. Just curious, why those particular databases? Would postgres work too?

How would I "look at each json record while it passes by"? Can I effectively loop through the JSON file record at a time? Any examples on how to do that? Pretty sure I could implement that once I saw an example of something similar.

AlexisW · July 13, 2022, 1:37pm

It's just that, if the JSON looked like that: {{"1"},{"2"}},{{{"1"},{"2"}},{"1"}} there you have 35 characters, so 35 B of text, but it only contains 5 B of data, the rest is just { and ". So once the JSON is read and saved in memory, it takes less space. This is not the case in your file.

Yes, works perfectly if you have a server running. These two DB have the advantage that a database is just a file on disk. It's easier to use if you don't already have postgres installed.

Here is a starting point:

url <- "https://d25kgz5rikkq4n.cloudfront.net/cost_transp..."

cur_line <- 1
cur_line <- cur_line + 1

# get current line
mytext <- scan(gzcon(url(url)),
           what = character(),
           sep = "\n",
           skip = cur_line - 1,
           n = 1L)

# nb of characters in that line
nchar(mytext)

substr(mytext[[1]], 1, 100)

individual_fields <- strsplit(mytext, ",")[[1]]

length(individual_fields)
nchar(individual_fields) |> head()


# extract key-value
keyvals <- stringr::str_split_fixed(individual_fields, ":", n=2) |>
  as.data.frame() |>
  setNames(c("key", "value"))


head(keyvals)

keyvals$value[keyvals$key == '"billing_code"']

You can replace the cur_line <- 1 by a while loop, with the appropriate stopping condition.

This example is very simple as in I'm reading each line one-by-one, ignoring all the { and } structure, and looking for key-values pairs. Then it's relatively easy to look for one particular key. Depending on what you're looking for, you might need to keep track of the number of open { etc.

If you don't know much about the structure of the file, it's a hard task. You probably should start with manual exploration to get a "feeling" of what the fields are in the file, and how they're organized. Ideally there should be some kind of documentation or manual distributed with the file to give you an idea of what the columns mean and what to expect in there.

jeffe · July 28, 2022, 7:47pm

Looking to better understand this same data-set. Have you had any success you could share?
Best

DarkHelmet · July 28, 2022, 8:33pm

I have not been able to do this as gracefully as I would have hoped, however perhaps my expectations at the onset were unrealistic.

What I have done is:

Downloaded ~ 1 GB of the file via PowerShell
Extracted locally using 7-zip
Utilized a very basic R script to save the first few JSON lines to a small text file
Ingested a single JSON line from #3 and converted to nested lists using rjson::fromJSON
Wrote a series of nested for loops normalizing into a data frame

Pending availability of time, I plan on incrementally expanding the logic of #5 across larger and larger portions of my downloaded subset per #1. If I can get through all of #1 I will buy a 10+ TB external hard drive to see if I can broaden this other files / sources.

system · August 18, 2022, 8:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.