Problems with findAssocs and creating a association network

DLU · January 18, 2020, 1:33pm

Dear all,
I'm very new at Rstudio but I'm trying to perform textanalyses on a simple excelfile. I created a wordcloud following the instructions on these websites RPubs - Text mining R-cran Descriptions - Part A and Text mining and word cloud fundamentals in R : 5 simple steps you should know - Easy Guides - Wiki - STHDA
I got the wordcloud working and I would like to pick out the top 5 words and try to analyse them so I can see which words associate most frequently. I would like to visualise that with an association network for each of the 5 words. I started with the function findAssocs but then I get every existing word as outcome with a corlimit of 1. I got reading about this topic and I came to the conclusion that, if i'm right, that the problem is that my 1015 lines excelfile get's turned into 1 document in stead of 1015. I allready tried the following solution that was offered to someone else but that doens't seem to work either.
corp <- Corpus(DataframeSource(Kans))

dtm <- DocumentTermMatrix(corp)
dtm
<<DocumentTermMatrix (documents: 1, terms: 4028)>>
Non-/sparse entries: 4028/0
Sparsity : 0%
Error in nchar(Terms(x), type = "chars") :
invalid multibyte string, element 270

I also changed my column headers into doc_id and text as I read in another case but that just turns it into 2 documents, so I just can't seem to get it working. I would very much appreciate it if someone could help me out. My file looks like this, I only included 5 lines that contain the word 'werkbon':

doc_id	text
42	Monteur kon niet klokken op overige afdeling. Hierdoor heeft hij zelf een werkbon aangemaakt en moeten de uren aangepast worden
56	1) Service technici heeft niet gebeld dat er 1 cilinder op lage druk staat. 2) De werkbon is niet juist controleert. 3) Actie is aangemaakt maar niks mee gedaan
62	start/stop klokken e.a. correcties doorgevoerd in tijdregistratie en werkbon
69	start/stop klokken verwijderd op werkbon en in de tijdregitratie
78	Niet in de werkbon gezet welk paneel vervangen moet worden. Tevens intern gemaild dat paneel teruggestuurd moet worden naar Schrack zodat wij het op garantie kunnen gooien. Dit staat nergens vermeld en de service technicus weet van niks. Na telefonisch overleg met Marinus Jan afgesproken dat hij het paneel meeneemt naar .

The original code I put in was:
install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("RColorBrewer")
install.packages (“corpus”)
install.packages("gdata")
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library("gdata")
library(corpus)
docs <- Corpus(VectorSource(Kans))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("dutch"))
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
findAssocs(dtm, terms = "werkbon", corlimit = 0.3)

Many thanks in advance.

With kind regards,

Diana

technocrat · January 18, 2020, 9:15pm

Hi, and welcome!

Thanks for including your code. It would be more helpful if it were in the form of a reproducible example, called a reprex, especially because it's hard to know where to look for the Kans corpus.

Also, it's better not to use install.packages. A better alternative is to leave the choice to the tester with

require(tm")
require(SnowballC)
require(wordcloud)
require(RColorBrewer)
require "corpus)
require(gdata)

I'll look for a suitable corpus and see if I can find a way to get the object you need.

technocrat · January 19, 2020, 3:52am

This exercise suggests that the problem is not with a single document

require(tm)
#> Loading required package: tm
#> Loading required package: NLP
require(wordcloud)
#> Loading required package: wordcloud
#> Loading required package: RColorBrewer
require (corpus)
#> Loading required package: corpus
data(crude)
set.seed(1234)
# start with minimal example
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, function(x)removeWords(x,stopwords()))
wordcloud(crude)

tdm <- TermDocumentMatrix(crude)
findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
#> $oil
#>       158     named   clearly      late    prices    trying    winter   markets 
#>      0.87      0.81      0.79      0.79      0.79      0.79      0.79      0.78 
#>      said  analysts agreement emergency    buyers     fixed      they 
#>      0.78      0.77      0.76      0.74      0.71      0.71      0.71 
#> 
#> $opec
#>              158         analysts           buyers        emergency 
#>             0.88             0.87             0.87             0.87 
#>             they          meeting       production            named 
#>             0.84             0.83             0.82             0.81 
#>             said           demand          address        addressed 
#>             0.80             0.79             0.78             0.78 
#>        advantage        agreement         although         analysis 
#>             0.78             0.78             0.78             0.78 
#>          analyst         anything       associates            bijan 
#>             0.78             0.78             0.78             0.78 
#>         brothers        cambridge           center             cera 
#>             0.78             0.78             0.78             0.78 
#>    characterized         cheating           closer        condition 
#>             0.78             0.78             0.78             0.78 
#>          control         critical          cutting           daniel 
#>             0.78             0.78             0.78             0.78 
#>            david           deemed          dillard         director 
#>             0.78             0.78             0.78             0.78 
#>          earlier             easy           editor            eight 
#>             0.78             0.78             0.78             0.78 
#>      environment           excess         excesses          expects 
#>             0.78             0.78             0.78             0.78 
#>            faces          harvard      immediately       initiative 
#>             0.78             0.78             0.78             0.78 
#>            issue             june             keep            learn 
#>             0.78             0.78             0.78             0.78 
#>           lesson              ltd          manager          mideast 
#>             0.78             0.78             0.78             0.78 
#>          mizrahi           mlotok moussavarrahmani         movement 
#>             0.78             0.78             0.78             0.78 
#>             need         optimism       optimistic     organization 
#>             0.78             0.78             0.78             0.78 
#>             paul      pessimistic        principal          problem 
#>             0.78             0.78             0.78             0.78 
#>         problems         prompted          quarter           quotas 
#>             0.78             0.78             0.78             0.78 
#>        readdress           regain         regional        reiterate 
#>             0.78             0.78             0.78             0.78 
#>          reuters           rising          salomon        scheduled 
#>             0.78             0.78             0.78             0.78 
#>           seeing          session         slackens            slide 
#>             0.78             0.78             0.78             0.78 
#>             soon             sort            spoke          spriggs 
#>             0.78             0.78             0.78             0.78 
#>           supply            teach        telephone          thought 
#>             0.78             0.78             0.78             0.78 
#>         together             told              try        uncertain 
#>             0.78             0.78             0.78             0.78 
#>      universitys         unlikely            wants           wishes 
#>             0.78             0.78             0.78             0.78 
#>           yergin          clearly    differentials          however 
#>             0.78             0.77             0.77             0.77 
#>             late          reports           trying           winter 
#>             0.77             0.77             0.77             0.77 
#> 
#> $xyz
#> numeric(0)
# create single document and repeat
refined <- crude[2]
wordcloud(refined)

tdm <- TermDocumentMatrix(refined)
# reflecting shorter document
findAssocs(tdm, c("oil", "opec", "problem"), c(0.7, 0.75, 0.1))
#> $oil
#> numeric(0)
#> 
#> $opec
#> numeric(0)
#> 
#> $problem
#> numeric(0)
# no terms meet the criteria
# ∴ problem is not a single document

^{Created on 2020-01-18 by the reprex package (v0.3.0)}

DLU · January 19, 2020, 6:25pm

Thank you so much for willing to help Technocrat and I will do my best to be more helpful with reproducible examples the next time. I tried the code you mentioned with a test file only having 4 lines. I imported the excelfile, but when I run data(test) it gives the following error:

data('Test')
Warning message:
In data("Test") : data set ‘Test’ not found

I find that really strange because when I try to look at the data through > Test I do see those 4 lines. I'm really sorry if this is a stupid question, but even searching online on this error didn't give me much of an answer. I also tried giving in the location path but I still get the same warning. And when I try to create a wordcloud by putting in >wordcloud(Test) I get the following error: Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
But in your testcase you don't create a tdm neither before creating the wordcloud..I also installed the reprex package but when running reprex I only get this link or is that exactly what you need?
^{Created on 2020-01-19 by the reprex package (v0.3.0)}

Many thanks in advance again!

With kind regards,

Diana

technocrat · January 19, 2020, 10:45pm

@DLU, there are no stupid questions here!

data() is a function that loads one of the built-in data sets in the packages you have loaded. For example

suppressPackageStartupMessages(library(dplyr))
data(nasa)
nasa
#> Source: local array [41,472 x 4]
#> D: lat [dbl, 24]
#> D: long [dbl, 24]
#> D: month [int, 12]
#> D: year [int, 6]
#> M: cloudhigh [dbl[,24,12,6]]
#> M: cloudlow [dbl[,24,12,6]]
#> M: cloudmid [dbl[,24,12,6]]
#> M: ozone [dbl[,24,12,6]]
#> M: pressure [dbl[,24,12,6]]
#> M: surftemp [dbl[,24,12,6]]
#> M: temperature [dbl[,24,12,6]]

^{Created on 2020-01-19 by the reprex package (v0.3.0)}

So, yes, Test is in your namespace.

On to the second question, and this one will help with understanding how to read the help() pages. Think of R as school algebra writ large, f(x) = y.

Here f is wordcloud and y is Test.

Here's the function signature and the portion that describes the one thing it must have

wordcloud(words,freq,scale=c(4,.5),min.freq=3,max.words=Inf,
random.order=TRUE, random.color=FALSE, rot.per=.1,
colors="black",ordered.colors=FALSE,use.r.layout=FALSE,
fixed.asp=TRUE, ...)
Arguments

words
the words

It could be clearer. Text has words, but not in the form that wordcloud expects. Scrolling down to the example, it shows that words can either be a character string or a VCorpus object

library(tm)
#> Loading required package: NLP
library(wordcloud)
#> Loading required package: RColorBrewer
my_words <- "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
class(my_words)
#> [1] "character"
data(crude)
class(crude)
#> [1] "VCorpus" "Corpus"

^{Created on 2020-01-19 by the reprex package (v0.3.0)}

Finally, reprex The FAQ topic includes an overview and links to how-tos:

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

DLU · January 20, 2020, 6:24pm

Dear technocrat,

Thank you for explaining the content of the wordcloud function. I tried to load the data function again and I added a colum with doc_id as before and I tried it with both an excelfile as a text file but I still get the same error. Did I forget a step to create a dataset? I also tried to generate the replex but I'm getting an error with that to....:
Error: <callr_status_error: callr subprocess failed: :19:9: unexpected symbol
18: require(tm)
19: Loading required

Therefore I included a past of the code but probably not in the way that's it's any good to you, I'm really sorry. I also tried datapasta but that didn't seem to work either.

install.packages("reprex")
Installing package into ‘C:/Users/dluij/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/reprex_0.3.0.zip'
Content type 'application/zip' length 430477 bytes (420 KB)
downloaded 420 KB

package ‘reprex’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\dluij\AppData\Local\Temp\RtmpmW22y3\downloaded_packages

library(reprex)
require(tm)
Loading required package: tm
Loading required package: NLP
require(wordcloud)
Loading required package: wordcloud
Loading required package: RColorBrewer
require (corpus)
Loading required package: corpus
library(readxl)
Test <- read_excel("Diana studie/Big data/Rstudio/Test.xlsx")
View(Test)
Test

A tibble: 4 x 2

doc_id Text

1 1 worden opdrachten verstrekt aan 3e partijen maar er wordt geen werkbon aangemaak…
2 2 Geen activiteit aangemaakt op werkbon. Alsnog gedaan
3 3 foute tekst op werkbon laten staan.
4 4 Geen activiteit aangemaakt op werkbon. Alsnog gedaan

data(Test)
Warning message:
In data(Test) : data set ‘Test’ not found
library(readr)
Test <- read_csv("Diana studie/Big data/Rstudio/Test.txt")
Parsed with column specification:
cols(
doc_id Text = col_character()
)
View(Test)
Test

A tibble: 4 x 1

doc_id\tText

1 "1\tworden opdrachten verstrekt aan 3e partijen maar er wordt geen werkbon aangemaakt."
2 "2\tGeen activiteit aangemaakt op werkbon. Alsnog gedaan"
3 "3\tfoute tekst op werkbon laten staan."
4 "4\tGeen activiteit aangemaakt op werkbon. Alsnog gedaan"

data(Test)
Warning message:
In data(Test) : data set ‘Test’ not found
data('Test')
Warning message:
In data("Test") : data set ‘Test’ not found

With kind regards,

Diana

technocrat · January 20, 2020, 6:51pm

Don't use

data(Test)
#OR
data("Test")

any function for which Test is required, use

some_function(Test)

DLU · January 21, 2020, 6:32pm

Dear Technocrat,

I'm sorry, I'm really new to the R code but I guess data just means I should import my data there.... I got the function working now but I don't like the outcome and I still think that's because the 4 lines are turned into 1 text and therefore the cor will always be 1. I'll trie to explain how I see it. In my example, within the 4 lines, the word 'gedaan' appears in 2 of the 4 lines, therefore that's the most associated word when we are looking at the word 'werkbon', so I would expect a higher corr with 'gedaan' then with for example the word 'tekst' which only appears in 1 line. I hope you get what I mean. This was the full code I entered, I again tried with reprex, but I'm not sure it worked great, because I still get an error.

require(wordcloud)
Loading required package: wordcloud
Loading required package: RColorBrewer
require (corpus)
Loading required package: corpus
library(readxl)
Test <- read_excel("Test.xlsx")
View(Test)
Test

A tibble: 4 x 2

doc_id Text

1 1 worden opdrachten verstrekt aan 3e partijen maar er wordt geen werkbon aangemaak…
2 2 Geen activiteit aangemaakt op werkbon. Alsnog gedaan
3 3 foute tekst op werkbon laten staan.
4 4 Geen activiteit aangemaakt op werkbon. Alsnog gedaan

set.seed(1234)
docs <- Corpus(VectorSource(Test))
docs <- tm_map(docs, content_transformer(tolower))
Warning message:
In tm_map.SimpleCorpus(docs, content_transformer(tolower)) :
transformation drops documents
docs <- tm_map(docs, removePunctuation)
Warning message:
In tm_map.SimpleCorpus(docs, removePunctuation) :
transformation drops documents
docs <- tm_map(docs, removeWords, stopwords("dutch"))
Warning message:
In tm_map.SimpleCorpus(docs, removeWords, stopwords("dutch")) :
transformation drops documents
tdm <- TermDocumentMatrix(docs)
findAssocs(tdm, c("werkbon"), c(0.7))
$werkbon
aangemaakt activiteit alsnog cworden foute gedaan laten opdrachten
1 1 1 1 1 1 1 1
partijen staan tekst verstrekt
1 1 1 1

Many thanks again.

With kind regards,

Diana

technocrat · January 21, 2020, 8:57pm

Please don't be discouraged. Getting started in R can be slow, especially when taking it up for moderately difficult problems like text mining.

Here's a reprex for where you stand:

require(corpus)
#> Loading required package: corpus
require(dplyr)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
require(ggplot2)
#> Loading required package: ggplot2
require(tidytext)
#> Loading required package: tidytext
require(tm)
#> Loading required package: tm
#> Loading required package: NLP
#> 
#> Attaching package: 'NLP'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate
require(wordcloud)
#> Loading required package: wordcloud
#> Loading required package: RColorBrewer
Test <- structure(list(doc_id = c(1, 2, 3, 4), text = c("worden opdrachten verstrekt aan 3e partijen maar er wordt geen werkbon aangemaak…", 
"Geen activiteit aangemaakt op werkbon. Alsnog gedaan", "foute tekst op werkbon laten staan.", 
"Geen activiteit aangemaakt op werkbon. Alsnog gedaan")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
    cols = list(doc_id = structure(list(), class = c("collector_double", 
    "collector")), text = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))
Test
#> # A tibble: 4 x 2
#>   doc_id text                                                                   
#>    <dbl> <chr>                                                                  
#> 1      1 worden opdrachten verstrekt aan 3e partijen maar er wordt geen werkbon…
#> 2      2 Geen activiteit aangemaakt op werkbon. Alsnog gedaan                   
#> 3      3 foute tekst op werkbon laten staan.                                    
#> 4      4 Geen activiteit aangemaakt op werkbon. Alsnog gedaan

set.seed(1234)
docs <- Corpus(VectorSource(Test))
docs <- tm_map(docs, content_transformer(tolower))
#> Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
#> transformation drops documents
docs <- tm_map(docs, removePunctuation)
#> Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation drops
#> documents
docs <- tm_map(docs, removeWords, stopwords(stopwords_nl))
#> Warning in if (is.na(resolved)) kind else if (identical(resolved, "porter"))
#> "english" else resolved: the condition has length > 1 and only the first element
#> will be used
#> Error in stopwords(stopwords_nl): no stopwords available for 'aan'no stopwords available for 'al'no stopwords available for 'alles'no stopwords available for 'als'no stopwords available for 'altijd'no stopwords available for 'andere'no stopwords available for 'ben'no stopwords available for 'bij'no stopwords available for 'daar'no stopwords available for 'dan'no stopwords available for 'dat'no stopwords available for 'de'no stopwords available for 'der'no stopwords available for 'deze'no stopwords available for 'die'no stopwords available for 'dit'no stopwords available for 'doch'no stopwords available for 'doen'no stopwords available for 'door'no stopwords available for 'dus'no stopwords available for 'een'no stopwords available for 'eens'no stopwords available for 'en'no stopwords available for 'er'no stopwords available for 'ge'no stopwords available for 'geen'no stopwords available for 'geweest'no stopwords available for 'haar'no stopwords available for 'had'no stopwords available for 'heb'no stopwords available for 'hebben'no stopwords available for 'heeft'no stopwords available for 'hem'no stopwords available for 'het'no stopwords available for 'hier'no stopwords available for 'hij'no stopwords available for 'hoe'no stopwords available for 'hun'no stopwords available for 'iemand'no stopwords available for 'iets'no stopwords available for 'ik'no stopwords available for 'in'no stopwords available for 'is'no stopwords available for 'ja'no stopwords available for 'je'no stopwords available for 'kan'no stopwords available for 'kon'no stopwords available for 'kunnen'no stopwords available for 'maar'no stopwords available for 'me'no stopwords available for 'meer'no stopwords available for 'men'no stopwords available for 'met'no stopwords available for 'mij'no stopwords available for 'mijn'no stopwords available for 'moet'no stopwords available for 'na'no stopwords available for 'naar'no stopwords available for 'niet'no stopwords available for 'niets'no stopwords available for 'nog'no stopwords available for 'nu'no stopwords available for 'of'no stopwords available for 'om'no stopwords available for 'omdat'no stopwords available for 'onder'no stopwords available for 'ons'no stopwords available for 'ook'no stopwords available for 'op'no stopwords available for 'over'no stopwords available for 'reeds'no stopwords available for 'te'no stopwords available for 'tegen'no stopwords available for 'toch'no stopwords available for 'toen'no stopwords available for 'tot'no stopwords available for 'u'no stopwords available for 'uit'no stopwords available for 'uw'no stopwords available for 'van'no stopwords available for 'veel'no stopwords available for 'voor'no stopwords available for 'want'no stopwords available for 'waren'no stopwords available for 'was'no stopwords available for 'wat'no stopwords available for 'werd'no stopwords available for 'wezen'no stopwords available for 'wie'no stopwords available for 'wil'no stopwords available for 'worden'no stopwords available for 'wordt'no stopwords available for 'zal'no stopwords available for 'ze'no stopwords available for 'zelf'no stopwords available for 'zich'no stopwords available for 'zij'no stopwords available for 'zijn'no stopwords available for 'zo'no stopwords available for 'zonder'no stopwords available for 'zou'



dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(d$word,d$freq,c(6,.3),1)
#> Warning in wordcloud(d$word, d$freq, c(6, 0.3), 1): werkbon could not be fit on
#> page. It will not be plotted.


findAssocs(tdm, c("werkbon"), c(0.01))
#> $werkbon
#>        aan aangemaakt aangemaak… activiteit     alsnog    cworden      foute 
#>          1          1          1          1          1          1          1 
#>     gedaan       geen      laten       maar opdrachten   partijen      staan 
#>          1          1          1          1          1          1          1 
#>      tekst  verstrekt      wordt 
#>          1          1          1

#werkbon is associated with all the words in the term documentation matrix

^{Created on 2020-01-21 by the reprex package (v0.3.0)}

Give me a couple of days to find a larger Dutch corpus to see if that's more in line with your expectations.

DLU · January 22, 2020, 1:19pm

@technocrat
That would be great and I'm really appreciating your help, but I'm still leaning towards the idea that it has something to do with my first suggestion. By creating the Corpus all the lines get transferred into 1 textbundle or 1 document. Please see this article/question: https://www.researchgate.net/post/findAssocs_in_TM_package_in_R_help.
I think that's why the output of corr 1 is generated. I'm not really sure how a larger Dutch corpus could give us more insight.... Could you clarify what you are trying to realise with that? Not as a comment or anything but I just really want to learn more about this. Also, if you don't have the time for that I completely understand, so don't feel obligated.

Gr. Diana

DLU · January 25, 2020, 2:58pm

@ technocrat

It seems like I got the file working the way I wanted it to and I'm getting some output with the findAssocs function. But now I'm confronted with the next problem and that is that inside some of the lines that specific word is mentioned 2 or 3 times in that specific line. Therefore each of the words that is also in that specific line are showing much corr, but in fact that shouldn't be the case. Is there any function that helps me to remove every word that is mentioned twice within the same line that you know of?

I've added the code I now used so you can see what I did differently now. I'm not sure due to what changes it seemed to work now but I hope you can tell me that so I can better understand what I actually did.

setwd("C:\Users\dluij\Documents\Diana studie\Big data\Rstudio\Nieuw")
require(tm)
Loading required package: tm
Loading required package: NLP
require(wordcloud)
Loading required package: wordcloud
Loading required package: RColorBrewer
require(ggplot2)
Loading required package: ggplot2

Attaching package: ‘ggplot2’

The following object is masked from ‘package:NLP’:

annotate

require(corpus)
Loading required package: corpus
Kans<- read.csv("Test.txt", header = TRUE, stringsAsFactors = FALSE)
review_text<- paste(Kans$text)

review_sourc<- VectorSource(review_text)
corpus<- Corpus(review_sourc)
corpus<- tm_map(corpus,content_transformer(tolower))
Warning message:
In tm_map.SimpleCorpus(corpus, content_transformer(tolower)) :
transformation drops documents
corpus<- tm_map(corpus,removeNumbers)
Warning message:
In tm_map.SimpleCorpus(corpus, removeNumbers) :
transformation drops documents
corpus<- tm_map(corpus, removePunctuation)
Warning message:
In tm_map.SimpleCorpus(corpus, removePunctuation) :
transformation drops documents
corpus<- tm_map(corpus,stripWhitespace)
Warning message:
In tm_map.SimpleCorpus(corpus, stripWhitespace) :
transformation drops documents
corpus <- tm_map(corpus, removeWords, stopwords("dutch"))
Warning message:
In tm_map.SimpleCorpus(corpus, removeWords, stopwords("dutch")) :
transformation drops documents

tdm <- TermDocumentMatrix(corpus)
findAssocs(tdm, c("werkbon"), c(0.1))
$werkbon
beide facturatie gezet handmatige aangemaakt
0.24 0.23 0.22 0.21 0.20
ter dubbele tekst refesh data
0.19 0.18 0.17 0.16 0.16
geschakeld jdg tav terugsturen wrktijd
0.16 0.16 0.16 0.16 0.16
ansul commentaar compleet deels verdwijnen
0.16 0.16 0.16 0.16 0.16
volgorde voorspellen geschreven vermeld afgehandeld
0.16 0.16 0.15 0.15 0.15
onmogelijk correspondentie aangezeten onderbroken omgeklokt
0.15 0.15 0.15 0.15 0.14
terug staat staan aangepast gereed
0.13 0.13 0.13 0.13 0.12
dezelfde opdrachten wekelijks verschillende foutieve
0.12 0.12 0.11 0.11 0.11
binnendienst klokt gegenereerd genoemde omtrent
0.11 0.11 0.11 0.11 0.11
werktijd appjes reageren telefoontjes zomaar
0.10 0.10 0.10 0.10 0.10
amsterdam camp complex haven new
0.10 0.10 0.10 0.10 0.10
datum erop reis omgezet terecht
0.10 0.10 0.10 0.10 0.10
vorige bord syntho planners nieuwbouw
0.10 0.10 0.10 0.10 0.10
neerzetten straks streepje bevestigd overeen
0.10 0.10 0.10 0.10 0.10
eerstvolgende gelezen teller belangrijke dringende
0.10 0.10 0.10 0.10 0.10
drukke ergernis hersteld lipje mening
0.10 0.10 0.10 0.10 0.10
microswitch standbewaking stuur verschaffen verbruikte
0.10 0.10 0.10 0.10 0.10
hangt ingeleverd contractpersoon frustratie mintuten
0.10 0.10 0.10 0.10 0.10
artikelnr
0.10

technocrat · January 25, 2020, 6:43pm

Great. Please see FAQ: How do I mark a solution? and the solution for the benefit of those to follow. (No false modesty!)

system · February 15, 2020, 6:43pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.