Importing Large NDJSON Files into R

SMUCH97 · May 20, 2021, 4:31pm

I came across this website that explains a process of importing very large ndjson files in R, by splitting the records into segments: Importing Large NDJSON Files into R – RLang.io | R Language Programming

In this process, they initially tried to split by 50,000 records, and made en empty file for the segments to be split in:


split -l 50000 data.json ./import/tweets_

I have tried following this process, however I keep getting the error "Error: unexpected numeric constant in "split -l 50000" I have never come across 'split' before nor do i understand what -l is. Can you care to explain?

Additionally, the next line of code given prints the headers:


head -1 import/tweets_da | grep -oP '"([a-zA-Z0-9\-_]+)"\:'

Again, I do not understand what the -1 part is, plus I am sure this returns the same error above when i try this. Additionally, I do not quite understand where the 'import/tweets_da' comes from. If anyone could explain what is going on here it would be very helpful.

I have been trying to find a way for a long time to work with a 7GB ndjson file in R and have so far been unsuccessful. If this process I am pursuing is not any good, I am open to any other suggestions. To give context, I am aiming to do some type of textual analysis on twitter posts .

technocrat · May 20, 2021, 5:09pm

The example treatment is not within R; it is performed through shell commands in a terminal with standard *nix commands.

This command operates on a file data.json./import/tweets_ in a subdirectory of the current directory to split it into 50,000 line numbered pieces. Without more this simply displays on the screen.

operates on the file import/tweets_da (within a subdirectory to the directory from which the shell is operating) by extracting the first line with head 1 and then piping the result using the | operator, which is equivalent to the %>% in R to a grep command to search for sequences of upper and lower case letters, numbers and the -_] characters. Without more, this will simply display to screen.

For the reasons given in the linked post, preprocessing the data outside R may be required unless very large RAM resources are made available. As noted, the data format is a single-line file. An alternative approach would be to use awk, sed or a custom parser in flex/bison, C/C++, Golang, Haskell or another language that reads from stdio and writes to stdout, which goes far to surmount the difficulties with large files because they act in a streaming fashion.

If none of this makes sense, it is due to unfamiliarity with the Linux/macOS programming environment. That is well worth learning but isn't something that can be done in a few hours.

As a possible alternative, look into the {ndjson} package for "streaming" json, which addresses the same issue in similar way but through R.

SMUCH97 · May 21, 2021, 2:47pm

Thank you for this. This was helpful.

I'm on Windows so i downloaded cygwin and managed to use the split function on there (just had to place my dataset within the working directory). Much easier than anticipated, definitely the easiest way I have come across to import large ndjson files to R.

system · May 28, 2021, 2:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.