Ask R to differentiate nouns, verbs and adjectives

juandmaz · June 19, 2023, 11:30pm

Is there any way to give R a text in a txt. file and ask it to count the 5 most used adjectives, the 5 most used nouns and the 5 most used verbs?

scottyd22 · June 20, 2023, 1:18pm

You can try using the tidytext package, specifically the parts_of_speech() function.

https://rdrr.io/cran/tidytext/man/parts_of_speech.html

technocrat · June 20, 2023, 11:42pm

As @scottyd22 suggests, all the tools are in {tidytext}. Following the free text, the process is pretty straightforward.

Read the text into a tibble (this will be called a "corpus")
Tokenize (all lowercase, no punctuation)
Remove the "stopwords" ("a", "an", "is", "that", etc.)
Get a word frequency list.
Scan the top 15 and select the word types.
There will probably be more nouns than adjectives and verbs, so take the next 15. When full up with nouns, start ignoring those.

It's also possible to do parts of speech recognition.

juandmaz · June 21, 2023, 6:29am

Thanks, do you know if there is a similar package that can do the same but in Spanish?

williaml · June 21, 2023, 6:43am

You could try it with Spanish stop words:

Spanish Stopwords for tidytext package | Swimming the Data Lake (rbind.io)

juandmaz · June 21, 2023, 10:17am

Hi thanks but I don't need the stopwords in Spanish, what I need in Spanish to be able to differentiate verbs, nouns and adjectives.

nirgrahamuk · June 21, 2023, 10:26am

The following tidytext issue questioning use with spanish has a link to a spanish lexicon that could be utilised

juandmaz · June 21, 2023, 10:40am

Hi thanks but this issue is about a sentiment dictionary of words, it's not what I need.

nirgrahamuk · June 21, 2023, 10:44am

Sorry that the info provided was not helpful.
Perhaps something from https://universaldependencies.org/ would be helpful as it claims to be a project for collecting parts of speech data for many languages

technocrat · June 21, 2023, 11:30am

Those NLP POS (natural language processing parts of speech) tools exist for many languages including Spanish. You can use a Spanish stop words list to get rid of “a, el, la, le, este…” none of which are of interest and get in the way. Once you’ve trimmed those if you are only interested in the Top 5s and not in classifying the entire document into POS, you need nothing beyond the {base} R tools and Spanish as a second language basic literacy to scan the top results and pick out the five most frequent occurrences of each type of word. Because word frequencies follow Zipf’s law, a relatively few words will account for a relatively large part of the text. Just a seat of the pants guess, but I’d be very surprised if it were necessary to scan more than a few dozen of. The top words to get the Top 5s.

system · June 28, 2023, 12:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.