Removing spanish stopwords from text

MElgner · October 4, 2022, 6:42pm

Im a complete newbie in R Studios and working in an academical project in spanish.

Our plan has multiple steps:

1. Insert text into R
1. Separate it into words
1. Remove stopwords so there are only names of authors etc.
1. Show distances between these names

Our text is more than 1000 pages long.

Afterwands we want to compare this list of names with an other list of names, but this doesn’t have to be done in R Studios.
Greetings!

technocrat · October 4, 2022, 8:03pm

This is a problem in Natural Language Processing and its tools for named entity recognition. For the first task, tokenization, the {tidytext} package provides the necessary tools.

Using stopwords to isolate names of persons is not the preferred method. For example, it could reduce Sheila Blanco to just Sheila. Stopwords are used to discard the common parts of speech, such as prepositions and conjunctions that have high frequency but low information content.

The usual way is to use a corpus of names of persons, institutions or placenames in a separate corpus. The {nametag} package does this for four languages but not Spanish. However, at the end of a GitHub repo several Spanish resources are given. NER

MElgner · October 5, 2022, 6:52am

Thaks a lot! I'll try it and maybe return for more questions^^

system · November 16, 2022, 6:53am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.