Text Mining - Language learning

Elle · October 26, 2022, 7:03am

Morning Everyone! I listen to podcasts in different languages and I also read the accompanying transcript. Using R, I would like to copy the transcript, put all the words of the transcript in a list with a count, and then sort alphabetically. Thus, I would create the following.

INPUT
'My name is Max! Welcome to my podcast!! This is the first podcast!' (in reality this might be a four page pdf or word document)

OUTPUT
first 1
is 2
max 1
my 2
name 1
podcast 2
the 1
this 1

How do I do this in R please? Many thanks for your help in advance

Flm · October 26, 2022, 7:41am

library(tidyverse)

a <- tibble(
  text = "My name is Max! Welcome to my podcast!! This is the first podcast!"
)

sapply(a, function(x) strsplit(x, split = " ")) %>%  
  unlist() %>% 
  tolower() %>% 
  as_tibble() %>% 
  mutate(value = str_replace_all(value, "[^[:alnum:]]", "")) %>% 
  count(value)

Elle · October 26, 2022, 10:08am

Hey @Flm

Thank you for the code, it works great for English words, do you know how I can use for other languages? So at the moment I want to do this for Russian, French and German. Any ideas?

I got the output below when I put Russian words through the code...

> a <- tibble(
+   text = "Привет, друзья! Меня зовут Макс и добро пожаловать на мой подкаст! 
+   Да, наконец-то, наконец-то я запустил, я сделал свой подкаст! 
+   Ухуууу! И я очень, очень, очень рад этому!"
+   )
> sapply(a, function(x) strsplit(x, split = " ")) %>%  
+   unlist() %>% 
+   tolower() %>% 
+   as_tibble() %>% 
+   mutate(value = str_replace_all(value, "[^[:alnum:]]", "")) %>% 
+   count(value)
# A tibble: 22 x 2
   value          n
   <chr>      <int>
 1 ""             4
 2 "<U+0434><U+0430>"           1
 3 "<U+0434><U+043E><U+0431><U+0440><U+043E>"        1
 4 "<U+0434><U+0440><U+0443><U+0437><U+044C><U+044F>"       1
 5 "<U+0437><U+0430><U+043F><U+0443><U+0441><U+0442><U+0438><U+043B>"     1
 6 "<U+0437><U+043E><U+0432><U+0443><U+0442>"        1
 7 "<U+0438>"            2
 8 "<U+043C><U+0430><U+043A><U+0441>"         1
 9 "<U+043C><U+0435><U+043D><U+044F>"         1
10 "<U+043C><U+043E><U+0439>"          1
# ... with 12 more rows

Flm · October 26, 2022, 10:40am

Mmm quite strange, because if I run the same code this is the result:

library(tidyverse)
a <- tibble(
  text = "Привет, друзья! Меня зовут Макс и добро пожаловать на мой подкаст! 
   Да, наконец-то, наконец-то я запустил, я сделал свой подкаст! 
   Ухуууу! И я очень, очень, очень рад этому!"
)

a

sapply(a, function(x) strsplit(x, split = " ")) %>%  
  unlist() %>% 
  tolower() %>% 
  as_tibble() %>% 
  mutate(value = str_replace_all(value, "[^[:alnum:]]", "")) %>% 
  count(value)

Elle · October 27, 2022, 9:07am

Hi @Flm

I noticed today that although I get odd output in the console, I get the right output in the viewer! Thanks for this code, super helpful for me.

system · November 3, 2022, 9:08am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.