Dear all,
I have been using the rtweet package in conjunction with seededlda and quanteda, in order to analyse the tweets of Emmanuel Macron (President of France) and Marine Le Pen (President of National Rally and his most prominent opponent).
I have not encountered any problems up until now regarding the utilisation of topic modelling, however a single issue has perplexed me. I wish to isolate the top five topics within the discourse of each politician in order to place them into graphs. I have already done something similar with their most frequently utilised words, just like this example: 6 Topic modeling | Text Mining with R
However, I was wondering if the same principle can be applied for visualisations such as a bar plot or a stacked area chart regarding the topics themselves (Topic 9, 13, 5 etc.). In the case of the latter, I had observed that isolating certain topics via the filter command left gaps in the graph. I understand that this is due to the fact that the topics are correlated within the corpus, and by removing some of them will always create the problem.
I would be really grateful if you could provide me with some additional resources that might assist me in examining the discourse of the two via topic modelling, or any suggestion for further analyses or visualisations. I have already utilised the LDAvis package and it is excellent, but I was wondering if it is possible to make things a bit tighter in terms of topic selection. I have included the code below along with the packages that I utilised. Please notify me if I need to change anything in the post or if you need a reprex.
library(rtweet)
library(tidyverse); theme_set(theme_minimal())
library(quanteda)
library(stm)
library(seededlda)
library(stopwords)
library(stringi)
library(lubridate)
library(topicmodels)
LePen2020b<-get_timeline("@MLP_officiel",n=3200)
LePen_terrorism_2<-as.data.frame(LePen2020b)
saveRDS(LePen_terrorism_2,"LePen_terrorism_2.rds")
LePen_Patyb<-readRDS("LePen_terrorism_2.rds")
LePen_corpus_2020b<-LePen_Patyb%>%
mutate(text_full = case_when(is_retweet ~ retweet_text, is_quote ~ paste(text, "|", quoted_text), TRUE ~ text)) %>%filter(created_at>"2020-09-01",created_at<"2020-12-01") %>% corpus(docid_field = "status_id",
text_field = "text_full")
LePen_dfm_Paty<-LePen_corpus_2020b%>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE) %>%tokens_remove(french_stopwords) %>%
tokens_ngrams(n = 1:3) %>%dfm(tolower = TRUE,
stem = FALSE) %>%
dfm_trim(min_docfreq = 0.005,max_docfreq = 0.99, docfreq_type = "prop")
LePen_lda<-textmodel_lda(LePen_dfm_Paty,k=25)
saveRDS(LePen_lda,"LePen2020lda.rds")
theta_long_LePenb <- LePen_lda$theta %>%
as.data.frame() %>%
rownames_to_column("doc_id") %>%
pivot_longer(cols = starts_with("topic"),
names_to = "topic",
values_to = "theta") %>%
mutate(topic = fct_inorder(topic))
theta_long_LePenb %>%
group_by(topic) %>%
summarise(theta = mean(theta), .groups = "drop") %>%
mutate(topic = fct_reorder(topic, theta)) %>%
ggplot(aes(x = theta, y = topic)) +
geom_col() +
scale_x_continuous(limits = c(0, 0.15))
LePen_weighted <- LePen_lda$phi %>%
as.data.frame() %>%
rownames_to_column("topic") %>%
pivot_longer(-topic, names_to = "word", values_to = "phi") %>%
tidylo::bind_log_odds(set = topic, feature = word, n = phi) %>% # <-- insertion
arrange(-log_odds_weighted) %>%
group_by(topic) %>%
slice_max(order_by = log_odds_weighted, n = 7) %>%
summarise(label = toString(word))
theta_long_LePenb %>%
group_by(topic) %>%
summarise(theta = mean(theta), .groups = "drop") %>%
left_join(LePen_weighted, by = "topic") %>%
mutate(topic = fct_reorder(topic, theta)) %>% ggplot(aes(x = theta, y = topic, fill = topic, label = label)) +
geom_col(show.legend = FALSE) +
geom_text(hjust = 0, nudge_x = 0.005, size = 3) +
scale_x_continuous(limits = c(0, 0.35))
LePen_meta<-theta_long_LePenb %>% left_join(LePen_Patyb, by = c("doc_id" = "status_id")) %>%
mutate(day_created = as.Date(created_at)) %>% filter(created_at>"2020-09-01",created_at<"2020-12-01") %>%
group_by(topic, day_created) %>%
summarise(theta = mean(theta), .groups = "drop") %>%
group_by(day_created)
theta_long_meta_LePen %>%
mutate(sum = sum(theta))
LePen_meta %>%filter(topic=="topic2"|topic=="topic3"|topic=="topic9") %>% mutate(month_created = lubridate::floor_date(day_created, "month")) %>%
group_by(topic, month_created) %>%
summarise(theta = mean(theta), .groups = "drop") %>%
ggplot(aes(x = month_created, y = theta, color = topic))+geom_line(size=1.3)+labs(x=NULL,title = "Le Pen Frames on Terrorism")+
scale_x_date(expand = c(0, 0), date_breaks = "1 month", date_labels = "%B %Y") +
theme(axis.text.x = element_text(angle = 45, hjust=1),
legend.position = "bottom")
This is the entirety of the code for Le Pen. I did not wish to burden you with both of them, but I also wanted to be as thorough as possible. I would greatly appreciate your suggestions and help.
Best regards,
MiltR