Character encoding issue - tokenized data

jpcronin · January 8, 2020, 9:16pm

I am running into an issue with character encoding while doing text mining using the tidyverse. I am looking at an Italian dataset, and after tokenizing my data I am noticing that some characters are not translating properly. E.g. sometimes a word like "un'altra" will end up as "un<U+0092>altra" (this issue with the apostrophe is not consistent either), or a word will end up as "communit.

I have tried to fix in this tokenized data set by: changing to utf-8 using utf8, changing to latin-1 using stringi, but with no success, even if the encoding changes.

Is there a solution to this, either in the way the data is tokenized or with changing the encoding of the tokenized data.

I am using a Windows laptop on R v 3.6.2.

technocrat · January 8, 2020, 11:18pm

Hi, and welcome!

Try tau

library(tau)
txt <- "The quick br\xfcn f\xf6x j\xfbmps \xf5ver the lazy d\xf6\xf8g." 
Encoding(txt) <- "latin1"
txt
#> [1] "The quick brün föx jûmps õver the lazy döøg."

^{Created on 2020-01-08 by the reprex package (v0.3.0)}

jpcronin · January 9, 2020, 10:20pm

Thanks so much for this, it solved the problem, once I went back to an earlier data set where characters were represented with \ rather than <>.

technocrat · January 10, 2020, 12:29am

Great. Please mark the solution for the benefit of those to follow.

jlacko · January 10, 2020, 7:21am

What tokenizer are you using? Most text mining tools around are optimized for English, with things like non ASCII characters, complicated inflections etc. causing some degree of pain.

I had good results with udpipe package; the lemmatization was particularly helpful.

system · January 31, 2020, 7:21am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.