I am using the new conText
package in R
to run a context embedding regression model. This model allows me to assess whether the context in which a focal word appears -- the words before and after it -- varies as a function of covariates. Below I provide the code I have written thus far:
# load packages
library(quanteda)
library(ldatuning)
library(topicmodels)
library(tidytext)
library(tidyverse)
library(parallel)
library(conText)
library(data.table)
library(text2vec)
# load speeches
speeches <- read_csv("speeches_final.csv")
# create corpus
# preparing speeches
speeches$text <- as.character(speeches$text)
speeches$docnames <- seq.int(nrow(speeches))
speeches_corpus <- quanteda::corpus(speeches,text_field ="text")
# tokenize corpus removing unnecessary (i.e. semantically uninformative) elements
toks <- tokens(speeches_corpus, remove_punct = T, remove_symbols = T, remove_numbers = T,
remove_separators = T)
# clean out stopwords and words with 2 or fewer characters
toks_nostop <- tokens_select(toks, pattern = stopwords("ru", source = "snowball"), selection = "remove",
min_nchar = 3
)
# only use features that appear at least 5 times in the corpus
feats <- dfm(toks_nostop, tolower = T, verbose = TRUE) %>% dfm_trim(min_termfreq = 5) %>%
featnames()
# leave the pads so that non-adjacent words will not become adjacent
toks <- tokens_select(toks_nostop, feats, padding = TRUE)
# build a tokenized corpus of contexts sorrounding the target term 'economy'
economy_toks <- tokens_context(x = toks, pattern = "экономи*", window = 6L)
# build document-feature matrix
economy_dfm <- dfm(economy_toks)
economy_dfm[1:3, 1:3]
# construct the feature co-occurrence matrix for our toks object (see above)
toks_fcm <- fcm(toks, context = "window", window = 6, count = "frequency", tri = FALSE)
# estimate glove model using text2vec
glove <- GlobalVectors$new(rank = 300, x_max = 10, learning_rate = 0.05)
wv_main <- glove$fit_transform(toks_fcm, n_iter = 10, convergence_tol = 0.001, n_threads = parallel::detectCores()) # set to 'parallel::detectCores()' to use all available cores
wv_context <- glove$components
local_glove <- wv_main + t(wv_context) # word vectors
local_transform <- compute_transform(x = toks_fcm, pre_trained = local_glove, weighting = "log")
All of the above code executes without issue. The problem occurs when I try to run the next chunk of code, the actual conText
model. In this case, my focal word is экономи* (Russian for economy) and my covariates are dummy variables for date and party affiliation.
# run the context embedding regression model
set.seed(2021L)
model1 <- conText(formula = "экономи*" ~ Date_dummy + party_ur,
data = toks,
pre_trained = local_glove,
transform = TRUE, transform_matrix = local_transform,
bootstrap = TRUE, num_bootstraps = 10,
permute = TRUE, num_permutations = 100,
window = 6L, case_insensitive = TRUE,
verbose = TRUE)
When I run this code, I receive the following error message: Error in solve.default(t(X_mat) %*% X_mat) : system is computationally singular: reciprocal condition number = 0
. This suggests that the design matrix is not invertible. I have performed checks to make sure that my variables are not collinear. I have tried debugging the code to no avail. I am truly lost as to what is going on here. Note that when I run the above model with Date_dummy
as the only covariate, I do get results. This leads me to believe that something is going on with the party_ur
variable. I am happy to provide my full code and data if that would help. Any feedback would be greatly appreciated.