Hello PJ,
Thanks for your mail and please find the code below for your review,
tweetData = read.csv("tweets.csv",stringsAsFactors = FALSE)
train = tweetData[tweetData$type=="train",-c(4)]
test = tweetData[tweetData$type=="test",-c(4)]
head(train,n=5)
classifier date
1 1 Mon Apr 06 22:45:40 PDT 2009
2 1 Mon Apr 06 23:01:15 PDT 2009
3 1 Mon Apr 06 23:21:30 PDT 2009
4 1 Tue Apr 07 01:03:56 PDT 2009
5 1 Tue Apr 07 03:16:35 PDT 2009
text
1 Bad news was Dad has cancer and is dying Good news new business started and I am now a life coach practising holistic weight management
2 im lonely keep me company! 22 female, new york
3 Sad about Kutner being killed off my fav show House!
4 is going to priceline (city) tomorrow, but lost her 'must haves' list
5 Difficulties with GTalk Closing the Division for the day. Later, everyone
library(tm)
tweets.corpus = Corpus(VectorSource(train$text))
summary(tweets.corpus)
inspect(tweets.corpus[1:5])
#Data Cleaning
tweets.corpus = tm_map(tweets.corpus,tolower)
tweets.corpus = tm_map(tweets.corpus,stripWhitespace)
tweets.corpus = tm_map(tweets.corpus,removePunctuation)
tweets.corpus = tm_map(tweets.corpus,removeNumbers)
my_stopwords = c(stopwords("english"),'available')
tweets.corpus = tm_map(tweets.corpus,removeWords,my_stopwords)
when doing the data cleaning, I am getting the following error Error in FUN(content(x), ...) : invalid multibyte string 1777
Please let me know if you know any solution about this.