What is the best tokenizer to be used for keras

konradino · January 25, 2019, 1:49pm

The standard keras tokenizer framework is the following:

tokenizer <- text_tokenizer(num_words = num_words) %>% 
  fit_text_tokenizer(df_train$text)

sequences <- texts_to_sequences(tokenizer, df_train$text)

However, the problem with that approach is that it doesn't allows for instance to:

a) prune the corpus (with quanteda for instance) from most frequent words or

b) exclude stopwords or

c) perform any other standard NLP actions on the set

Can anyone recommend another tokenization approach that has it's output compatible with keras models input requirements (coming from texts_to_sequences())?

jlacko · January 25, 2019, 3:38pm

I had some very good results from package udpipe - especially when using languages other than English (Czech, my native language, belongs to the Western Slavic family).

The function is udpipe::udpipe_annotate(). I have used it as input for vocabulary based LSTM keras models, with positive results. Classification accuracy improved when I started using lemmas instead of plain tokens.

konradino · January 25, 2019, 3:48pm

Thanks for pointing that! Could you provide a minimal example of formatting the output of the udpipe::udpipe_annotate() into the format expected by keras where each word in the sequence is referenced by an index of the dictionary?

jlacko · January 25, 2019, 5:01pm

I can do that, as it is an interesting problem and I have the code.

It will not easily fit the format of a forum post, as there is a lot the "minimal" example has to do:

tokenize a piece of text
build a vocabulary
build a matrix input
build & verify the model (ok, this part is optional, but it is the most fun)

I will make it a blog post on www.jla-data.net instead; that should not be a problem

jlacko · January 26, 2019, 10:35pm

I wrote a blog post covering the subject, as it took a bit more space than is available in a forum post.
You owe me a beer

It is built on a toy scenario of classifying authorship of 1000 tweets by two popular accounts (tweets are a neat subject for text classification).

It has accuracy of ~93% for starters, which is not bad and can be still improved upon.

konradino · January 27, 2019, 12:02pm

Thanks, that's actually great! Let me come back to you on this one again throughout the week when I find the time to review in greater detail! Haha, I'll give you a shout when in Prague

jlacko · January 27, 2019, 8:44pm

Thanks! It should give you a start. Do shout out if you run into trouble.

konradino · January 30, 2019, 7:43am

Hi @jlacko - over here on the quanteda github page we're discussing an option of adding a default option for converting it's tokens and dfm objects directly into a keras compatible object with quanteda's convert function: link. I believe your inputs could be valuable