Text analysis and TF-IDF with multiwords

Hi, I am quite new at R and I am trying to run a text analysis and TF-IDF in a bunch of reports considering a specific set of words in a dictionary I built. The code below has provided the results for that, however, it has failed to consider multi-words. For instance, it can count "technology" but not "data technology". Could you please help me to fix the code so multi-words are included in the analysis?

See the code I am using below:

# Load libraries
library(tidyverse)
library(tm)
library(tidytext)
library(readxl)

# Setting the folder where the documents are (set to subfolder 2012 for now to make it easier to handle)
wd <- "C:/Users/ple.si/Dropbox (CBS)/Manegerial Digital Attention (MAD)/New set 10K/2012"

# Create the corpus and clean it up a bit
corpus <- Corpus(DirSource(wd, recursive = TRUE)) # Create corpus
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, removeWords, stopwords("english")) # remove English stop words

# Create a DocumentTerm Matrix
dtm <- DocumentTermMatrix(corpus)

# Use multiple steps to...
corpus_words <- tidy(dtm) %>% # ... transform the dtm to a tidy object
  bind_tf_idf(term, document, count) # ... use the tf_idf function from tidytext to calculate 

total_words <- corpus_words %>% group_by(document) %>% summarize(total = sum(count)) # Calculate the number of words in each document
corpus_words <- left_join(corpus_words, total_words) # add it to the table

# Get the words of interest from the dictionary and rename the columns
dictionary <- read_xlsx("C:/Users/ple.si/Dropbox (CBS)/Manegerial Digital Attention (MAD)/New set 10K/DictionaryLIWCDigital_OnlyDigital_TG.xlsx", col_names = FALSE)
names(dictionary) <- c("term", "group")

# Take the individual term lists
inno_terms    <- dictionary$term[dictionary$group==1]
techno_terms  <- dictionary$term[dictionary$group==2]
data_terms    <- dictionary$term[dictionary$group==3]
digital_terms <- dictionary$term[dictionary$group==4]

# Filter the corpus for the words of interest
TF_IDF_Inno_terms2 <- corpus_words %>% filter(grepl(paste(inno_terms, collapse = "|"), term))
TF_IDF_techno_terms2 <- corpus_words %>% filter(grepl(paste(techno_terms, collapse = "|"), term))
TF_IDF_data_terms2 <- corpus_words %>% filter(grepl(paste(data_terms, collapse = "|"), term))
TF_IDF_digital_terms2 <- corpus_words %>% filter(grepl(paste(digital_terms, collapse = "|"), term))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.