Filtering within Bigram results

Does anyone know how I can get word count results filtered per document? My current code shows me the total number of occurrences of a bigram but for the entire PDF corpus rather than per document.


files = list.files(pattern = "pdf$")
all=lapply(files, pdf_text)
document= Corpus(VectorSource(all))

document= tm_map(document, content_transformer(tolower))
document= tm_map(document, removeNumbers)
document= tm_map(document, removeWords, stopwords("english"))
document= tm_map(document, removePunctuation)

PDFDataframe= data.frame(text = sapply(document, as.character),
stringsAsFactors = FALSE)

New_bigrams= PDFDataframe%>%
unnest_tokens(bigram, text, token= "ngrams", n= 2)

bigrams_separated= New_bigrams%>%
separate(bigram, c("word1", "word2"), sep= " ")

bigrams_filtered= bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)

bigrams_filtered %>%
filter(word1== "information") %>%
count(word2== "security")

See the FAQ: How to do a minimal reproducible example reprex for beginners. The structure of the PDFDataframe object is not shown, making it hard to provide help.

If it has one variable for document id and another for text, it works similarly to

#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 3)

austen_bigrams[which(austen_bigrams$bigram == "ten thousand pounds"),] %>% count(book)
#> # A tibble: 3 × 2
#>   book                    n
#>   <fct>               <int>
#> 1 Sense & Sensibility     2
#> 2 Pride & Prejudice       4
#> 3 Persuasion              1
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.