In this process, they initially tried to split by 50,000 records, and made en empty file for the segments to be split in:
split -l 50000 data.json ./import/tweets_
I have tried following this process, however I keep getting the error "Error: unexpected numeric constant in "split -l 50000" I have never come across 'split' before nor do i understand what -l is. Can you care to explain?
Additionally, the next line of code given prints the headers:
head -1 import/tweets_da | grep -oP '"([a-zA-Z0-9\-_]+)"\:'
Again, I do not understand what the -1 part is, plus I am sure this returns the same error above when i try this. Additionally, I do not quite understand where the 'import/tweets_da' comes from. If anyone could explain what is going on here it would be very helpful.
I have been trying to find a way for a long time to work with a 7GB ndjson file in R and have so far been unsuccessful. If this process I am pursuing is not any good, I am open to any other suggestions. To give context, I am aiming to do some type of textual analysis on twitter posts .
The example treatment is not within R; it is performed through shell commands in a terminal with standard *nix commands.
This command operates on a file data.json./import/tweets_ in a subdirectory of the current directory to split it into 50,000 line numbered pieces. Without more this simply displays on the screen.
operates on the file import/tweets_da (within a subdirectory to the directory from which the shell is operating) by extracting the first line with head 1 and then piping the result using the | operator, which is equivalent to the %>% in R to a grep command to search for sequences of upper and lower case letters, numbers and the -_] characters. Without more, this will simply display to screen.
For the reasons given in the linked post, preprocessing the data outside R may be required unless very large RAM resources are made available. As noted, the data format is a single-line file. An alternative approach would be to use awk, sed or a custom parser in flex/bison, C/C++, Golang, Haskell or another language that reads from stdio and writes to stdout, which goes far to surmount the difficulties with large files because they act in a streaming fashion.
If none of this makes sense, it is due to unfamiliarity with the Linux/macOS programming environment. That is well worth learning but isn't something that can be done in a few hours.
I'm on Windows so i downloaded cygwin and managed to use the split function on there (just had to place my dataset within the working directory). Much easier than anticipated, definitely the easiest way I have come across to import large ndjson files to R.