However, the problem with that approach is that it doesn't allows for instance to:
a) prune the corpus (with quanteda for instance) from most frequent words or
b) exclude stopwords or
c) perform any other standard NLP actions on the set
Can anyone recommend another tokenization approach that has it's output compatible with keras models input requirements (coming from texts_to_sequences())?
I had some very good results from package udpipe - especially when using languages other than English (Czech, my native language, belongs to the Western Slavic family).
The function is udpipe::udpipe_annotate(). I have used it as input for vocabulary based LSTM keras models, with positive results. Classification accuracy improved when I started using lemmas instead of plain tokens.
Thanks for pointing that! Could you provide a minimal example of formatting the output of the udpipe::udpipe_annotate() into the format expected by keras where each word in the sequence is referenced by an index of the dictionary?
Thanks, that's actually great! Let me come back to you on this one again throughout the week when I find the time to review in greater detail! Haha, I'll give you a shout when in Prague
Hi @jlacko - over here on the quanteda github page we're discussing an option of adding a default option for converting it's tokens and dfm objects directly into a keras compatible object with quanteda's convert function: link. I believe your inputs could be valuable