Creating co-occurance network from a corpus of multiple pdfs

abooferas · December 11, 2021, 8:58am

Hello everyone!

I'm really new to the R language and I'm working on some text mining for multiple pdf files. So far I managed to make a word cloud and a bar chart.
I need to make a co-occurrence network, basically to visualize the most used terms and the terms used with them. I have been doing a lot of googling, I wasted days trying to make this without any result.

Can anyone help me with some guidance or a sample code?

Here is my complete code so far:

require(pdftools)# reads pdf documents
require(tm)#text mining analysis
require(wordcloud)
require(RColorBrewer)


files<-list.files(pattern = "pdf$")#create vector of pdf file names (i included the pdf files in the same foldr)

alcohol<-lapply(files, pdf_text) #loads all the files

length(alcohol)# check the number of files

lapply(alcohol, length) #check the length of each file


pdfdatabase <- Corpus(URISource(files), readerControl = list(reader = readPDF)) #crearing a pdf database 
pdfdatabase <- tm_map(pdfdatabase, removeWords, stopwords("english")) 
pdfdatabase <- tm_map(pdfdatabase, removeNumbers) 
alcohol.tdm <- TermDocumentMatrix(pdfdatabase, control = list(removePunctuation = TRUE,
                                                              stopwords = TRUE,
                                                              tolower = TRUE,
                                                              streaming = FALSE,
                                                              removeNumbers = TRUE,
                                                              bounds = list(global = c(3, Inf))))



ft <- findFreqTerms(alcohol.tdm, lowfreq = 20, highfreq = Inf)

as.matrix(alcohol.tdm[ft,])

ft.tdm <- as.matrix(alcohol.tdm[ft,])
sort(apply(ft.tdm, 1, sum), decreasing = TRUE)



#find frequent terms
findFreqTerms(alcohol.tdm, lowfreq = 10)
#Examine frequent terms and their association
findAssocs(alcohol.tdm, terms = "sensor", corlimit = 0.5)




#convert term document matrix to data frame
m <- as.matrix(alcohol.tdm)
v <- sort(rowSums(m),decreasing = TRUE)
d <- data.frame(word = names(v), freq=v)


#create wrodcloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,
          max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))



#Create Bar chart
barplot(d[1:11,]$freq, las = 2, names.arg = d[1:11,]$word,
        col = "lightblue", main = "Most frequent words",
        ylab = "Word freqencies")

system · January 1, 2022, 8:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.