I am running into an issue with character encoding while doing text mining using the tidyverse. I am looking at an Italian dataset, and after tokenizing my data I am noticing that some characters are not translating properly. E.g. sometimes a word like "un'altra" will end up as "un<U+0092>altra" (this issue with the apostrophe is not consistent either), or a word will end up as "communit.
I have tried to fix in this tokenized data set by: changing to utf-8 using utf8, changing to latin-1 using stringi, but with no success, even if the encoding changes.
Is there a solution to this, either in the way the data is tokenized or with changing the encoding of the tokenized data.
What tokenizer are you using? Most text mining tools around are optimized for English, with things like non ASCII characters, complicated inflections etc. causing some degree of pain.
I had good results with udpipe package; the lemmatization was particularly helpful.