AdamCU
January 2, 2022, 5:25am
1
Does anyone know how I can get word count results filtered per document? My current code shows me the total number of occurrences of a bigram but for the entire PDF corpus rather than per document.
library(pdftools)
library(tm)
library(dplyr)
library(tidytext)
library(tidyr)
files = list.files(pattern = "pdf$")
files
all=lapply(files, pdf_text)
document= Corpus(VectorSource(all))
document= tm_map(document, content_transformer(tolower))
document= tm_map(document, removeNumbers)
document= tm_map(document, removeWords, stopwords("english"))
document= tm_map(document, removePunctuation)
PDFDataframe= data.frame(text = sapply(document, as.character),
stringsAsFactors = FALSE)
New_bigrams= PDFDataframe%>%
unnest_tokens(bigram, text, token= "ngrams", n= 2)
bigrams_separated= New_bigrams%>%
separate(bigram, c("word1", "word2"), sep= " ")
bigrams_filtered= bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered %>%
filter(word1== "information") %>%
count(word2== "security")
See the FAQ: How to do a minimal reproducible example reprex
for beginners . The structure of the PDFDataframe
object is not shown, making it hard to provide help.
If it has one variable for document id and another for text, it works similarly to
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
library(janeaustenr)
austen_bigrams <- austen_books() %>%
unnest_tokens(bigram, text, token = "ngrams", n = 3)
austen_bigrams[which(austen_bigrams$bigram == "ten thousand pounds"),] %>% count(book)
#> # A tibble: 3 × 2
#> book n
#> <fct> <int>
#> 1 Sense & Sensibility 2
#> 2 Pride & Prejudice 4
#> 3 Persuasion 1
1 Like
system
Closed
January 23, 2022, 7:13am
3
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.