How to convert a large json file to csv in R

martinjhnhadley · March 8, 2019, 12:14pm

Because I've not really done much of it before, I thought I'd have a go at this for you.

Getting JSON Headlines

There's a Kaggle Dataset containing a million headlines from ABC News available as a .csv file here: A Million News Headlines | Kaggle

The code below augments the data with dates and randomly sampled names for authors from the babynames package. This tibble is then converted to JSON with toJSON and exported for analysis.

library("tidyverse")
library("lubridate")
library("babynames")
library("jsonlite")
library("here")

headlines_tib <- read_csv(here("abcnews-date-text.csv"))

headlines_tib <- headlines_tib %>%
  mutate(publish_date = ymd(publish_date)) %>%
  mutate(year = year(publish_date)) %>%
  mutate(authors = sample(babynames$name, nrow(.), replace = TRUE))

headlines_json <- toJSON(headlines_tib)

headlines_json %>% 
  write_json("abcnews-headlines.json")

Importing JSON

Our dataset is contained in a single .json file that's 98Mb in size with 1.1million headlines. The read_json function will take a very long time to import this file. Fortunately, someone else on Community asked about importing very large JSON files here:

This code will stream in our file and then convert it from JSON into a nice tibble we can use. Note that I've used the tictoc package so you can time how long this process takes.

library("tictoc")
tic()
abc_news_headlines <- stream_in(file(here("abcnews-headlines.json")))
abc_news_headlines <- fromJSON(as.character(abc_news_headlines$V1))
toc()

Tagging headlines with categories

As I said at the top, I've not done much of this before. I leaned fairly heavily on this article:

We're going to use the udpipe package to extract and tally NOUN-VERB pairs in our dataset to identify potential categories. We'll only consider headlines from 2017 as I don't want to lock up my computer for longer than a coffee break. So we can join this dataset with others later on, I'm also adding a unique ID for each headline:

abc_news_2017 <- abc_news_headlines %>%
  filter(year == 2017) %>%
  mutate(headline_id = row_number())

To do this we need to download the latest version of our model, note that the filename changes after each update so ensure not to hard code the filename

library("udpipe")

model <- udpipe_download_model(language = "english")
file_udpipe <- list.files(pattern = ".udpipe")
udmodel_english <- udpipe_load_model(file = file_udpipe)

Now we annotate our headlines with udpipe_annotate() and convert the output into a tibble

tic()
annotations_headlines <- udpipe_annotate(udmodel_english, abc_news_2017$headline_text)
toc()
# 272.85 sec elapsed
headlines_annotated <- annotations_headlines %>%
  data.frame() %>%
  as_tibble()

Now we can extract our NOUN-VERB phrases using keywords_phrases()

headlines_annotated$phrase_tag <-
  as_phrasemachine(headlines_annotated$upos, type = "upos")
noun_verb_stats <-
  keywords_phrases(
    x = headlines_annotated$phrase_tag,
    term = tolower(headlines_annotated$token),
    pattern = "(A|N)*N(P+D*(A|N)*N)*",
    is_regex = TRUE,
    detailed = FALSE
  ) %>%
  as_tibble()

Let's look at the most common NOUN-VERB phrases, they definitely look suitable as categories:

noun_verb_stats %>%
  filter(ngram > 1) %>%
  filter(freq > 3) %>%
  arrange(desc(freq)) %>%
  slice(1:10) %>%
  mutate(keyword = fct_reorder(keyword, freq)) %>%
  ggplot(aes(x = keyword, y = freq)) +
  geom_col() +
  coord_flip() +
  labs(title = "Most common noun-verb phrases of least two words in ABC News Headlines")

Let's split our categories according to the number of words they contain:

two_word_topics <- noun_verb_stats %>%
  filter(ngram == 2) %>%
  filter(freq > 3) %>%
  select(keyword)

three_word_topics <- two_word_topics <- noun_verb_stats %>%
  filter(ngram == 3) %>%
  filter(freq > 3) %>%
  select(keyword)

The tidytext package now becomes really useful for us to detect our categories in the ngrams of our headlines:

tidy_two_word_topics <- abc_news_2017 %>%
  unnest_tokens(word, headline_text, token = "ngrams", n = 2) %>%
  filter(word %in% two_word_topics$keyword) %>%
  select(headline_id, word) %>%
  group_by(headline_id) %>%
  rename(two_word_topics = word) %>%
  mutate(two_word_topics = map(two_word_topics, ~list(.x))) %>%
  select(headline_id, two_word_topics) %>%
  mutate(n_two_word_topics = map_int(two_word_topics, ~length(.x))) %>%
  ungroup()

tidy_three_word_topics <- most_recent_headlines %>%
  unnest_tokens(word, headline_text, token = "ngrams", n = 3) %>%
  filter(word %in% three_word_topics$keyword) %>%
  select(headline_id, word) %>%
  group_by(headline_id) %>%
  rename(three_word_topics = word) %>%
  mutate(three_word_topics = map(three_word_topics, ~list(.x))) %>%
  select(headline_id, three_word_topics) %>%
  mutate(n_three_word_topics = map_int(three_word_topics, ~length(.x))) %>%
  ungroup()

tidy_topics <- tidy_two_word_topics %>%
  full_join(tidy_three_word_topics) %>%
  mutate(topics = pmap(list(two_word_topics, three_word_topics), ~c(.x, .y))) %>%
  mutate(n_topics = pmap_int(list(n_two_word_topics, n_three_word_topics), ~sum(.x, .y, na.rm = TRUE))) %>%
  select(headline_id, topics, n_topics)

Finally, we can rejoin this with our original headlines:

topiced_abc_news_2017 <- abc_news_2017 %>%
  full_join(tidy_topics) %>%
  filter(!is.na(n_topics)) %>%
  unnest(topics) %>%
  unnest(topics)

topiced_abc_news_2017 %>%
  head()
#> # A tibble: 6 x 7
#>   publish_date headline_text      year authors headline_id n_topics topics 
#>   <date>       <chr>             <dbl> <chr>         <dbl>    <dbl> <chr>  
#> 1 2017-01-01   ambulances ferry…  2017 Art               6        1 istanb…
#> 2 2017-01-01   manus island asy…  2017 Liesa            30        1 manus …
#> 3 2017-01-01   manus island asy…  2017 Liesa            30        1 island…
#> 4 2017-01-01   manus island asy…  2017 Ebb              31        1 manus …
#> 5 2017-01-01   manus island asy…  2017 Ebb              31        1 island…
#> 6 2017-01-01   north korea in f…  2017 Jeanet…          38        1 kim jo…