Lot to learn about statistics! Personally in that realm I'm an ant in the Amazon forest and after 12 years of chewing haven't progressed much past my first fallen tree.
English is a world language. Even those of us who spoke it first and first learned to read and write in it use it wildly differently. The only standard for English is whether two people can make the needed adjustment on both ends to make it work for the communication needed. And by that standard our versions of English (or, at least, yours) are superb. The only language rules that count in this community are the rules of R
syntax.
I agree with @nirgrahamuk's assessment that many of the variables may show perfect collinearity, especially when they are perfectly correlated or composed of NAs
.
But there are two other problems.
The first is that covariates \gg Aggregate.progression.
Applied Logistic Regression 3rd Edition by David W. Hosmer Jr., Stanley Lemeshow and Rodney X. Sturdivant (2009) has an excellent treatment of covariate selection for this case in Chapter 4: the three strategies of purposeful, step-wise forwards or stepwise backwards. stepwise forward and best subset. In all approaches, however, they caution against overfitting, which leads to numerically unstable test results.
It seems in your case that 6,000 covariates may reflect a generous vocabulary. From the goals of the analysis are all of them required? To take a trivial example, the common stopwords
in a text corpus are routinely discarded in natural language processing because their frequency overweights their scant semantic load. As the text authors note, the inclusion standard for any covariate is does its inclusion provide more information (as measured by a metric such as goodness of fit) if it is included?
Even in the reduced case of a 43 x 77 set, the noise produced is deafening in its crying out for feature reduction. Unless there is an a priori domain principle for selecting candidate variables, the tradeoffs are too complex.
What seems promising to be, in principle (says the man who is not embarking upon it himself) is subset selection by bootstrap sampling of the candidate covariates a handful at a time, applying a goodness of fit test such as the Hosmer-Lemeshaw found in ResourceSelection::hoslem.test
after first filtering by log likelihood. I've done subset testing on 10,000x20 data on an underpowered machine using that method, so subsets of 20 covariates is feasible. The combination of 77 covariates taken 15 at a time, say, combn[77,15] \approx 3.527931e+15 , is too many to exhaust without an array of very large instance cloud resources operating in parallel (my guess). However, putting the survivors into a single- or double- elimination process may yield a useful selection.
Quick caveat: it's conceivable, but I don't know, that something akin to Bayesian autocorrelation may rear its head due to co-occurrences of n-grams in the text, words that are routinely followed or preceded by specific other words.
The second problem that I have in mind is data structure. It might be better with the lexical tokens as rows and the number of occurrences as the sole variable for the purpose of creating Bayesian priors and attacking from that flank Rank speculation.
Finally, a question. I'm certain that NLP packages among them address this class of problem. Are you open to searching among them for that functionality with me?