I recently started reading about sentiment analysis using R and tried to implement it using sample data, which consists of 4 columns such as classifier, date, text, and type. When doing data cleaning using tm_map function to convert all the texts to lowercase, I have encountered with error "Error in FUN(content(x), ...) : invalid multibyte string 1777" for which I couldn't find possible solutions. If anyone has met with the same kind of issue and know the workaround to fix this, please let me know.
In order for us to help you with your question, please provide us a minimal reprocudible example where you provide a minimal (dummy) dataset and code that can recreate the issue. One we have that, we can go from there. For help on creating a Reprex, see this guide:
Hello PJ,
Thanks for your mail and please find the code below for your review,
tweetData = read.csv("tweets.csv",stringsAsFactors = FALSE)
train = tweetData[tweetData$type=="train",-c(4)]
test = tweetData[tweetData$type=="test",-c(4)]
head(train,n=5)
classifier date
1 1 Mon Apr 06 22:45:40 PDT 2009
2 1 Mon Apr 06 23:01:15 PDT 2009
3 1 Mon Apr 06 23:21:30 PDT 2009
4 1 Tue Apr 07 01:03:56 PDT 2009
5 1 Tue Apr 07 03:16:35 PDT 2009
text
1 Bad news was Dad has cancer and is dying Good news new business started and I am now a life coach practising holistic weight management
2 im lonely keep me company! 22 female, new york
3 Sad about Kutner being killed off my fav show House!
4 is going to priceline (city) tomorrow, but lost her 'must haves' list
5 Difficulties with GTalk Closing the Division for the day. Later, everyone
library(tm)
tweets.corpus = Corpus(VectorSource(train$text))
summary(tweets.corpus)
inspect(tweets.corpus[1:5])
#Data Cleaning
tweets.corpus = tm_map(tweets.corpus,tolower)
tweets.corpus = tm_map(tweets.corpus,stripWhitespace)
tweets.corpus = tm_map(tweets.corpus,removePunctuation)
tweets.corpus = tm_map(tweets.corpus,removeNumbers)
my_stopwords = c(stopwords("english"),'available')
tweets.corpus = tm_map(tweets.corpus,removeWords,my_stopwords)
when doing the data cleaning, I am getting the following error Error in FUN(content(x), ...) : invalid multibyte string 1777
Please let me know if you know any solution about this.
It seems your problem is not in you code, but in your input. invalid multibyte string likely refers to characters not recognized by the character encoding format.
Find out what encoding the file has (often issue when files were generated on for example Mac and then used on Windows or vice versa) and then specify that in R like so:
data = read.csv("data.csv", encoding="UTF-8")
Another option is to remove all special characters by using something like toString()