I have a large json file (around 80 Mb) and I want to convert it into csv to make it work in R.
It is a News Dataset and my primary task is to segregate the data based on the categories by identifying the keywords given in the news headlines
I would say, you should try and split your problem into steps. First step would be to read data from json to R. This you can do with jsonlite package.
Second step would be to do text mining (notice, however, that this step has nothing to do with json reading). With this, you can try CRAN task view (https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) or maybe take a look at tidytext package. Perhaps they have a vignette that does something similar.
Because I've not really done much of it before, I thought I'd have a go at this for you.
Getting JSON Headlines
There's a Kaggle Dataset containing a million headlines from ABC News available as a .csv file here: A Million News Headlines | Kaggle
The code below augments the data with dates and randomly sampled names for authors from the babynames package. This tibble is then converted to JSON with toJSON and exported for analysis.
Our dataset is contained in a single .json file that's 98Mb in size with 1.1million headlines. The read_json function will take a very long time to import this file. Fortunately, someone else on Community asked about importing very large JSON files here:
This code will stream in our file and then convert it from JSON into a nice tibble we can use. Note that I've used the tictoc package so you can time how long this process takes.
As I said at the top, I've not done much of this before. I leaned fairly heavily on this article:
We're going to use the udpipe package to extract and tally NOUN-VERB pairs in our dataset to identify potential categories. We'll only consider headlines from 2017 as I don't want to lock up my computer for longer than a coffee break. So we can join this dataset with others later on, I'm also adding a unique ID for each headline:
To do this we need to download the latest version of our model, note that the filename changes after each update so ensure not to hard code the filename
Sometimes, specific tool for specific format can come in handy.
For JSON, you could have a look at jq a JSON processor.
It has a R wrapper available too
For large JSON with nested content, it really helps to easily query what is inside. It could help you here.