Extracting and displaying German Umlaute in RStudio

Hi folks,

I'm trying to ectract German words from Facebook comments. At the moment, unfortunately, I'm not receiving any Umlaute (ä,ö,ü), but nothing (which means gaps) or a,o,u (iconv(x, "UTF-8", "ASCII//TRANSLIT" instead of iconv(enc2utf8(x), sub="byte") which leads to false results and further issues when removing stop words.

It might depend on the encoding options, which is set to UTF-8 and my locales are set to:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8.
I've read in different posts, that people are using German_Germany.1252 as locales but I cannot set it as default in R since R is rececting this call.

I'm using RCUrl, tm, rjson and stringi as packages. Here's my code so far:


url <-  "https://graph.facebook.com/v3.2/795931377273410/comments?limit=999&access_token=EAAHmLBC6OnsBAEbHfGfHi3iEBFmKZCEQUY0Pf6d3y5A7VbxsZBl4nk61UuZCLXwB14tS9uwmIwQZBEh6cG7KDHoePxJ9SHPDZCBLzrXKPjSbZB1t5TZCWqTZARcXCBkjZAZBePMooCa459M1uN8BrK26ttottyRd8QZBG5cE9ZCk1bEDhke8OrfnBLudg6ZCHzAmv4WoZD"

d<- getURL(url)

j<- fromJSON(d)

comments <- sapply(j$data,function(j) {list(comment=j$message)})

Cleanedcomments <- sapply(comments, function(x) iconv(enc2utf8(x), sub="byte"))

my_corpus <- Corpus(VectorSource(Cleanedcomments))
my_function <- content_transformer(function (x, pattern ) gsub("[^\x01-\x7F]",  "", x, pattern, "", x))

my_corpus <- tm_map(my_corpus, my_function, "/")
my_corpus <- tm_map(my_corpus, my_function, "@")
my_corpus <- tm_map(my_corpus, my_function, "\\|")
my_corpus <- tm_map(my_corpus, content_transformer(stri_trans_tolower))
my_corpus <- tm_map(my_corpus, removeNumbers)
my_corpus <- tm_map(my_corpus, removeWords, c(stopwords("german")))
my_corpus <- tm_map(my_corpus, removePunctuation)
my_corpus <- tm_map(my_corpus, stripWhitespace)

my_tdm <- TermDocumentMatrix(my_corpus)
m <- as.matrix(my_tdm)

If anyone has got an idea how to deal with the issue, I'd be glad to hear it.

Kind Regards

1 Like