Text Mining - foreign language output issue

Elle · October 27, 2022, 7:06am

Hello! Yesterday I asked a question which someone very kindly answered for me. The problem is that their answer doesn't work for me and we don't understand why.

Problem: Take some foreign language text and output a list of words by frequency and count. Here is the R code.

library(tidyverse)

a <- tibble(
  text = "Привет, друзья! Меня зовут Макс и добро пожаловать на мой подкаст! 
   Да, наконец-то, наконец-то я запустил, я сделал свой подкаст! 
   Ухуууу! И я очень, очень, очень рад этому!"
)

a <- sapply(a, function(x) strsplit(x, split = " ")) %>%  
  unlist() %>% 
  tolower() %>% 
  as_tibble() %>% 
  mutate(value = str_replace_all(value, "[^[:alnum:]]", "")) %>% 
  count(value)

a

The person who made it gets a nice list of words in Russian in alphabetical order with a count. But when I run the code this is what I see in the console...

> library(tidyverse)
> 
> a <- tibble(
+   text = "Привет, друзья! Меня зовут Макс и добро пожаловать на мой подкаст! 
+    Да, наконец-то, наконец-то я запустил, я сделал свой подкаст! 
+    Ухуууу! И я очень, очень, очень рад этому!"
+ )
> 
> 
> 
> a <- sapply(a, function(x) strsplit(x, split = " ")) %>%  
+   unlist() %>% 
+   tolower() %>% 
+   as_tibble() %>% 
+   mutate(value = str_replace_all(value, "[^[:alnum:]]", "")) %>% 
+   count(value)
> 
> a
# A tibble: 22 x 2
   value          n
   <chr>      <int>
 1 ""             6
 2 "<U+0434><U+0430>"           1
 3 "<U+0434><U+043E><U+0431><U+0440><U+043E>"        1
 4 "<U+0434><U+0440><U+0443><U+0437><U+044C><U+044F>"       1
 5 "<U+0437><U+0430><U+043F><U+0443><U+0441><U+0442><U+0438><U+043B>"     1
 6 "<U+0437><U+043E><U+0432><U+0443><U+0442>"        1
 7 "<U+0438>"            2
 8 "<U+043C><U+0430><U+043A><U+0441>"         1
 9 "<U+043C><U+0435><U+043D><U+044F>"         1
10 "<U+043C><U+043E><U+0439>"          1
# ... with 12 more rows

I'm assuming it's something in my settings if the code works fine for them but not for me. Any ideas what it could be that I need to change in RStudio or maybe on my computer? Many thanks

DavoWW · October 27, 2022, 7:39am

Hi @Elle,
Try changing the text encoding to UTF-8. Do this is RStudio via the "Tools > Global Options > Code > Saving > Serialization" menu.
HTH

Elle · October 27, 2022, 8:51am

Hi @DavoWW

Thanks for your suggestion. I've tried that (Changed from 'Ask' to 'UTF-8') and the code output is still the same. Any other ideas?

Update: Ah it's me! so in the bottom left pane, it outputs the weird code, but in the viewer window, it outputs as Russian! Thanks for trying to help me, much appreciated.

DavoWW · October 27, 2022, 11:29am

On my Windows machine I see the Russian text in my Console pane as well.
Maybe the locale R is using on your machine needs changing.
Check these commands for background information:

help(Sys.getlocale)
Sys.getlocale()
l10n_info()

Elle · October 27, 2022, 12:47pm

So when RStudio options is set to [Ask] or [UTF-8] running your locale code - I get the same output for both selections, as below.

> help(Sys.getlocale)
> Sys.getlocale()
[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
$system.codepage
[1] 1252

DavoWW · October 28, 2022, 4:14am

OK, so your locale is specifying a non-UTF-8 encoding.
If this is going to be an on-going issue for you, its time to upgrade to Windows 10 or 11 (if not using already) and also update R to the latest version (R-4.2.1 at the time of writing).

system · December 9, 2022, 4:15am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.