I am doing a sentiment analysis project for PhD research. I have been getting the following error:
Error in .tolower(txt) : invalid input 'ââ€' in 'utf8towcs'
This happens after I have cleaned the text in my corpus and I try to create a DocumentTermMatrix. After doing some initial research, I found that it is due to non-ASCII characters in the Twitter text, such as emojis. Can someone please tell me how to solve this problem? Thanks.
Here is the R code that I was using:
setwd('C:/rscripts/tweet_sentiment')
dataset = read.csv('hillary_tweets.csv')
library(readr)
library(tm)
library(ggplot2)
library(wordcloud)
library(plyr)
library(lubridate)
require(SnowballC)
text <- as.character(dataset$text)
sample <- sample(text, (length(text)))
corpus <- Corpus(VectorSource(list(sample)))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
dtm_up <- DocumentTermMatrix(VCorpus(VectorSource(corpus[[1]]$content)))
Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.
In this case, I'd include a snippet of your dataset
object, which includes non-ASCII characters to replicate your error.
And that way you can skip setwd('C:/rscripts/tweet_sentiment')
and dataset = read.csv('hillary_tweets.csv')
I'm having a hard time replicating your error, but as a quick suggestion, you might check out the r-package rtweet
. It has a plain_tweets
function that takes your tweets and returns a value "reformatted with ascii encoding and normal ampersands and without URL links, line breaks, fancy spaces/tabs, fancy apostrophes."
And there are tools to deal with non-ASCII characters in R rather than removing them. StackOverflow has nice discussions on this. And a reprex might be useful to help along these lines too.
Thank you for your reply. I was able to get the Document Term Matrix successfully completed. What I need to do right now is to feed the DTM to an XGBoost machine learning classification model. I am having some issues with getting the DTM to successfully work in the classifier. I will post another issue detailing this.
Jonathan Adkins
Awesome!
Would it be easy and useful to others to share your solution?
I will go ahead and post what I was able to come up with to solve my issue. This is only half of what I need to accomplish. I have recently posted another thread to ask for help on my other issue, which is adding my document term matrix to an XGBoost classifier. Here is the code that I used for the importing and cleaning of my Twitter dataset:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
setwd('C:/rscripts/random_forest')
dataset = read.csv('tweets_all.csv', stringsAsFactors = FALSE)
library(tm)
corpus <- iconv(dataset$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(cleanset, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
cleanset <- tm_map(cleanset, removeWords, c('Ã\u009dhillary','ââ¬Å¾Ã','ââ¬Å¡Ã','just','are','all','they'))
tdm <- TermDocumentMatrix(cleanset)
tdm
tdm <- as.matrix(tdm)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
1 Like