I'm not sure why people don't like being asked to provide a reproducible example. A question which starts with
is not very abstracted (but that's actually a good thing! See below). Also, expecting potential answerers to go through a multi-page link (Analyzing Texts with the text2vec package) and then having to guess what exactly you have tried and where you got stuck, is far less likely to result in any kind of answer, than showing us actual code which we can reproduce, and explaining point by point which part worked and which one didn't.
I don't think your question is very clear, but if I interpreted you correctly, then you asked "what's the general R solution to the age-old problem of Out-Of-Vocabulary words in NLP?". This is a bit like asking "What can I do to reduce the generalization gap of a neural network?": very abstract, and thus very hard (maybe even impossible) to answer. There isn't a single solution to OOV in NLP which works for all kinds of NLP tasks ( Reading comprehension; Natural language inference; Machine translation; Named entity recognition; Constituency parsing; Language modeling; Sentiment analysis; Skip-thoughts; Autoencoding; etc.). But there could be a good answer to a more specific question concerning how to deal with OOV for a particular task and on a specific data set, in the same way as the very specific question "What can I do to reduce the generalization gap of architecture so and so, on train/test set so and so, with loss function so and so, after having tried X and Y..." is much more likely to get an answer than its more abstract cousin.
A sane approach
Anyway, I'll make an attempt at answering your generic answer. If the task you're attempting to solve can benefit from word embeddings (e.g., language modeling), then the industry-standard solution is to just download Facebook's fastText pretrained word embeddings. You can find them here: https://fasttext.cc/
At 1 million words for the English model trained on Wikipedia, or 2 millions for the one trained on Common Crawl , I dare you to find any OOV words at test time
Of course, if you are inventing words on the spot (e.g., gearshift), then you won't find them in a pretrained fasttext
model. However, you can still use fasttext
to infer a word embedding even for an OOV word: in that case you will need one of the models trained with subword information, such as this one. See these examples on how to do that in practice:
and read this paper if you're interested in the theory behind the approach.
Note that fasttext
is not a Python module (though a Python wrapper exists: see below), but it's rather a library which you install & build under Linux or OSX. You can easily use it from a bash script, but if you're scared of scripts you'll have to use reticulate
and install the gensim
Python library, since gensim
includes a Python wrapper for fasttext
. I think this may be more complicated and error-prone than just running fasttext
as a system command, but if you really want to risk your sanity here are two SO posts to help you:
Sadly, I don't think there are R packages which have the same OOV handling capabilities of gensim
, but maybe you could try quanteda
: Ken Benoit is a nice bloke and if you ask a question on SO with the tags text-mining
and quanteda
, he may answer it himself (you may even try to drop him a line, if it's just to ask about quanteda
OOV capabilities). One thing you'll like for sure about quanteda
is how fast it is! Another guy who may be willing to help is Sebastian Ruder: AFAIK, he's mostly a Python user, but his knowledge of modern NLP is so vast that he may still be able to help.
An insane approach
If even fasttext
OOV capabilities are not good enough for you, you may try to implement your own Deep Learning model in keras
or tensorflow
(both available in R) to:
- learn word embeddings on the fly
- mimick word embeddings using subword RNNs: note that there's actually code for MIMICK, but that will be useless to you because it's in Python and it uses the NN library
dynet
: you may risk running it under R using reticulate
, but I don't foresee that ending well
- implement Hybrid Word-Character Models to achieve Open Vocabulary modeling: this has been developed for the Neural Machine Translation task, rather than for language modeling, and again there's code for it, but it's in Matlab, so you'd have to reimplement it in
keras
or TF
- (insanity level: ): you could actually use BERT to learn embeddings for OOV words based on context. BERT is written in Tensorflow, thus you may be able to run it with the
tensorflow
package, and since it's the most advanced language understanding model available, I'm willing to bet that it won't have any issues with your OOV words. However, this is the NLP equivalent of killing a fly with the Death Star!