Ask R to differentiate nouns, verbs and adjectives

Is there any way to give R a text in a txt. file and ask it to count the 5 most used adjectives, the 5 most used nouns and the 5 most used verbs?

You can try using the tidytext package, specifically the parts_of_speech() function.

https://rdrr.io/cran/tidytext/man/parts_of_speech.html

2 Likes

As @scottyd22 suggests, all the tools are in {tidytext}. Following the free text, the process is pretty straightforward.

  1. Read the text into a tibble (this will be called a "corpus")
  2. Tokenize (all lowercase, no punctuation)
  3. Remove the "stopwords" ("a", "an", "is", "that", etc.)
  4. Get a word frequency list.
  5. Scan the top 15 and select the word types.
  6. There will probably be more nouns than adjectives and verbs, so take the next 15. When full up with nouns, start ignoring those.

It's also possible to do parts of speech recognition.

Thanks, do you know if there is a similar package that can do the same but in Spanish?

You could try it with Spanish stop words:

Spanish Stopwords for tidytext package | Swimming the Data Lake (rbind.io)

1 Like

Hi thanks but I don't need the stopwords in Spanish, what I need in Spanish to be able to differentiate verbs, nouns and adjectives.

The following tidytext issue questioning use with spanish has a link to a spanish lexicon that could be utilised

Hi thanks but this issue is about a sentiment dictionary of words, it's not what I need.

Sorry that the info provided was not helpful.
Perhaps something from https://universaldependencies.org/ would be helpful as it claims to be a project for collecting parts of speech data for many languages

Those NLP POS (natural language processing parts of speech) tools exist for many languages including Spanish. You can use a Spanish stop words list to get rid of “a, el, la, le, este…” none of which are of interest and get in the way. Once you’ve trimmed those if you are only interested in the Top 5s and not in classifying the entire document into POS, you need nothing beyond the {base} R tools and Spanish as a second language basic literacy to scan the top results and pick out the five most frequent occurrences of each type of word. Because word frequencies follow Zipf’s law, a relatively few words will account for a relatively large part of the text. Just a seat of the pants guess, but I’d be very surprised if it were necessary to scan more than a few dozen of. The top words to get the Top 5s.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.