Examining data cluster contents.

Hi, I am very new to R and trying to figure out how to work with it.

I have combined 2 data frames (tweets.public and tweets.company) to generate a term-document matrix, which I later used for data clustering.

After getting all the clusters, I am trying to calculate the proportion of tweets.public in each cluster. How can I do that?

Please see my code below.
tweets.public and tweets.company are tweets posted by a company (say Apple) and tweets about the company downloaded for Twitter.

tweets = c(tweets.public$text, tweets.company$text)

tweets.corpus = Corpus(VectorSource(tweets))
tweets.corpus = tm_map(tweets.corpus, function(x) iconv(x, to='ASCII'))
tweets.corpus = tm_map(tweets.corpus, removeNumbers)
tweets.corpus = tm_map(tweets.corpus, removePunctuation)
tweets.corpus = tm_map(tweets.corpus, stripWhitespace)
tweets.corpus = tm_map(tweets.corpus, tolower)
tweets.corpus = tm_map(tweets.corpus, removeWords, stopwords('english'))
tweets.corpus = tm_map(tweets.corpus, stemDocument)

tweets.dtm = DocumentTermMatrix(tweets.corpus)
tweets.wdtm = weightTfIdf(tweets.dtm)
tweets.matrix = as.matrix(tweets.wdtm)

empties = which(rowSums(abs(tweets.matrix)) == 0)
tweets.matrix = tweets.matrix[-empties,]

After this I calculated the Kmeans and plotted the clusters.

K = kmeans(tweets.matrix, 3 , nstart = 20)
mds.tweets <- cmdscale(dist(tweets.matrix),k=2)
plot(mds.tweets, col = K$cluster,pch = 1:2)
legend("bottomright", c("tweets.public","tweets.company"), pch = c(1,2))

I know how to extract the tweets in each cluster:

cluster1_tweetsIndex = which(K$cluster == 1)
cluster1Tweets = tweets.matrix[cluster1_tweetsIndex,]

What I am trying to figure out is what are the no. of tweets from "tweets.public" in cluster 1.

Hope that explains.
Thanks in advance.

Hi, and welcome. To help you better, a reproducible example, called a reprex to illustrate the problem with a sample of your term-document matrix would go a long way. Without knowing the structure, it's hard to provide any useful guidance.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.