Error in FUN(content(x), ...) : invalid multibyte string 1777

lawrence2269 · August 3, 2019, 11:25pm

I recently started reading about sentiment analysis using R and tried to implement it using sample data, which consists of 4 columns such as classifier, date, text, and type. When doing data cleaning using tm_map function to convert all the texts to lowercase, I have encountered with error "Error in FUN(content(x), ...) : invalid multibyte string 1777" for which I couldn't find possible solutions. If anyone has met with the same kind of issue and know the workaround to fix this, please let me know.

Thanks in advance.

pieterjanvc · August 4, 2019, 2:37pm

Hi,

In order for us to help you with your question, please provide us a minimal reprocudible example where you provide a minimal (dummy) dataset and code that can recreate the issue. One we have that, we can go from there. For help on creating a Reprex, see this guide:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck!
PJ

lawrence2269 · August 5, 2019, 7:21am

Hello PJ,
Thanks for your mail and please find the code below for your review,

tweetData = read.csv("tweets.csv",stringsAsFactors = FALSE)
train = tweetData[tweetData$type=="train",-c(4)]
test = tweetData[tweetData$type=="test",-c(4)]
head(train,n=5)

classifier                         date
1          1 Mon Apr 06 22:45:40 PDT 2009
2          1 Mon Apr 06 23:01:15 PDT 2009
3          1 Mon Apr 06 23:21:30 PDT 2009
4          1 Tue Apr 07 01:03:56 PDT 2009
5          1 Tue Apr 07 03:16:35 PDT 2009
                                                                                                                                        text
1 Bad news was Dad has cancer and is dying   Good news new business started and  I am now a life coach practising holistic weight management
2                                                                                            im lonely  keep me company! 22 female, new york
3                                                                                      Sad about Kutner being killed off my fav show House! 
4                                                                     is going to priceline (city) tomorrow, but lost her 'must haves' list 
5                                                                Difficulties with GTalk  Closing the Division for the day. Later, everyone

library(tm)
tweets.corpus = Corpus(VectorSource(train$text))
summary(tweets.corpus)
inspect(tweets.corpus[1:5])

#Data Cleaning
tweets.corpus = tm_map(tweets.corpus,tolower)
tweets.corpus = tm_map(tweets.corpus,stripWhitespace)
tweets.corpus = tm_map(tweets.corpus,removePunctuation)
tweets.corpus = tm_map(tweets.corpus,removeNumbers)
my_stopwords = c(stopwords("english"),'available')
tweets.corpus = tm_map(tweets.corpus,removeWords,my_stopwords)

when doing the data cleaning, I am getting the following error Error in FUN(content(x), ...) : invalid multibyte string 1777

Please let me know if you know any solution about this.

pieterjanvc · August 7, 2019, 12:44pm

Hi,

It seems your problem is not in you code, but in your input. invalid multibyte string likely refers to characters not recognized by the character encoding format.

Find out what encoding the file has (often issue when files were generated on for example Mac and then used on Windows or vice versa) and then specify that in R like so:

data = read.csv("data.csv", encoding="UTF-8")

Another option is to remove all special characters by using something like toString()

Hope this helps,
PJ

system · August 28, 2019, 12:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.