Topic Modelling Preprocessing | Get rid of all special characters & symbols

Flocke · August 3, 2019, 11:28pm

Hi all,

today I've got a question related to preprocessing my dataset in order to do some topic modelling afterwards. Let's start with my current code:

library(plyr)
library(readr)
library(tidyverse)
library(tm)
library(wordcloud)
library(topicmodels)


# Create csv-Database
workingDir <- "/Users/flocke/Dropbox/Meins/FOM/Thesis/AmazonReviews/51558011"
myfiles = list.files(path=workingDir, 
                     pattern="*.txt", 
                     full.names=TRUE)

database = ldply(myfiles, 
                 read_csv, 
                 col_names = FALSE)

colnames(database) <- c("doc_id", 
                        "Date(YYYYMMDD)",
                        "ASIN",
                        "Stars",
                        "Helpful",
                        "OverallRating",
                        "Date(Text)",
                        "UserID",
                        "ReviewTitle",
                        "text")   # Review Text

# Create Corpus
amz_corpus <- Corpus(DataframeSource(database))

#Remove BS
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
amz_corpus <- tm_map(amz_corpus, toSpace, "\xc")
amz_corpus <- tm_map(amz_corpus, toSpace, "\\\\")
inspect(amz_corpus[62069])

amz_corpus <- tm_map(amz_corpus, toSpace, "[[:punct:]]")
amz_corpus <- tm_map(amz_corpus, toSpace, "@")
amz_corpus <- tm_map(amz_corpus, toSpace, "/")
amz_corpus <- tm_map(amz_corpus, toSpace, "&")
amz_corpus <- tm_map(amz_corpus, toSpace, "#")

#Cleaning up the text
amz_corpus <- tm_map(amz_corpus, stripWhitespace)
amz_corpus <- tm_map(amz_corpus, content_transformer(tolower))
amz_corpus <- tm_map(amz_corpus, removePunctuation)
amz_corpus <- tm_map(amz_corpus, removeNumbers)

#Remove Stopwords  **<- Point of failure**
amz_corpus <- tm_map(amz_corpus, removeWords, stopwords('english'))

#Text stemming
amz_corpus <- tm_map(amz_corpus, stemDocument)

#Build term-document-matrix
dtm <- TermDocumentMatrix(amz_corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 100)

I've get a bunch of errors like this:

Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), :
input string 61984 is invalid UTF-8

Inspecting the element mentioned above gives me the following output:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 8
Content:  documents: 1

                                                                                                                                                                                                                                                                                                                                                                                                  987 
\xc3  am not a h\xc3 p hop fun .\xc3  usally l\xc3 sten to jazz or jazz rock or reggea. but th\xc3 s album made me love the h\xc3 p hop.the lyr\xc3 cs w\xc3 ch are very overwhelm\xc3 ng and \xc3 mpress\xc3 ve.\xc3 t makes me happy \xc3 t puts me \xc3 n a good mood.\xc3   l\xc3 sten to \xc3 t every day and \xc3  am sure \xc3  w\xc3 ll never get bored of \xc3 t.thanks  lauren thanks a lot

I've done some investigations and found out, that this might be related to all kind of weird special characters which I might get rid of first.

Example:

So my questions is: How can I get rid of them?

Many thanks in advance!

technocrat · August 4, 2019, 2:54am

The first step is to recognize these as Unicode characters that are not mapped to UTF8. You can read about the gory details. I say that because this won't be the first time you encounter Unicode related issues, and this is a good example.

The help(Encoding) command shows you how this works in practice with an example

x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x

which yields façile. Emojis, Chinese or Japanese, Russian and a whole host of other characters present similar issues.

What I'd try first is textclean , which has a replace_non_ascii function that should get rid of most of these. Whether that creates a problem for your text analysis, I can't say, but you can take tokenized output, which should be a vector and test for the positions of any remaining non-ascii characters and write a short gsub function to replaced them with "". You may also want to use purrr's map function if you tokenized materials are in separate files.

system · August 25, 2019, 2:54am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.