Hi, I am very new to R and trying to figure out how to work with it.
I have combined 2 data frames (tweets.public and tweets.company) to generate a term-document matrix, which I later used for data clustering.
After getting all the clusters, I am trying to calculate the proportion of tweets.public in each cluster. How can I do that?
EDIT:
Please see my code below.
tweets.public and tweets.company are tweets posted by a company (say Apple) and tweets about the company downloaded for Twitter.
tweets = c(tweets.public$text, tweets.company$text)
tweets.corpus = Corpus(VectorSource(tweets))
tweets.corpus = tm_map(tweets.corpus, function(x) iconv(x, to='ASCII'))
tweets.corpus = tm_map(tweets.corpus, removeNumbers)
tweets.corpus = tm_map(tweets.corpus, removePunctuation)
tweets.corpus = tm_map(tweets.corpus, stripWhitespace)
tweets.corpus = tm_map(tweets.corpus, tolower)
tweets.corpus = tm_map(tweets.corpus, removeWords, stopwords('english'))
tweets.corpus = tm_map(tweets.corpus, stemDocument)
tweets.dtm = DocumentTermMatrix(tweets.corpus)
tweets.wdtm = weightTfIdf(tweets.dtm)
tweets.matrix = as.matrix(tweets.wdtm)
empties = which(rowSums(abs(tweets.matrix)) == 0)
tweets.matrix = tweets.matrix[-empties,]
After this I calculated the Kmeans and plotted the clusters.
K = kmeans(tweets.matrix, 3 , nstart = 20)
mds.tweets <- cmdscale(dist(tweets.matrix),k=2)
plot(mds.tweets, col = K$cluster,pch = 1:2)
legend("bottomright", c("tweets.public","tweets.company"), pch = c(1,2))
I know how to extract the tweets in each cluster:
cluster1_tweetsIndex = which(K$cluster == 1)
cluster1Tweets = tweets.matrix[cluster1_tweetsIndex,]
What I am trying to figure out is what are the no. of tweets from "tweets.public" in cluster 1.
Hope that explains.
Thanks in advance.