How do I use stemming correctly?

Hello everyone!

I'm working on a project where I'm analyzing the reviews of a certain product.

I broke down the sentences to individual words, so now in my dataframe every row is an individual word.

When I tried to count the most common words in the reviews, I ran into the "problem" that in the top 10 words, I basically have the same word twice, but Rstudio (rightfully) lists them as different words because they have different conjugations.

Here is what I mean, this is what happens when I run my code:

Review_words %>% count(word, sort =TRUE)

  1. use
  2. using
  3. good
  4. different
  5. differently

How do I chop off the conjugations from the words, so that I can work with the "core words" like "use, different, etc"?

Thanks for your help in advance!

1 Like

What you are looking for is lemmatisation:

According to the NLP task view, the R package udpipe provides this functionality. And some other sites recommend textstem as well.

This thread give a few examples:

Hope this helps.

Thank you very much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.