Text Mining with specific dictionary

SBAS · November 23, 2019, 4:12pm

I'm very new to text-mining and i want to ask help about a thing that I would like to do.
I have an excel document with 2 columns: id_text; text. Each row in this dataset rappresent a specific text. I would like to look up, for every single row, the presence of specific keywords: so I have a dictionary with 17 words that i should seek in my dataset. When a specific word included in my dictionary there is in the text of specific Id_Text, i would like print 1, else print 0.
I have some problem to try a packages or write a coding that could do this action. Someone can help me?

andresrcs · November 23, 2019, 4:28pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

SBAS · November 23, 2019, 5:11pm

Ok! I try to do that.

Id_text = c("1", "2", "3", "4")

Text = c("Obiettivo del progetto è migliorare i servizi di base dei Paesi in via di sviluppo. I destinatari dell'iniziativa sono la popolazione povera e vulnerabile", "L'iniziativa mira a favorire l'inclusione finanziaria dei soggetti che versano in estrema povertà","Le assimmetrie nella distribuzione della ricchezza sono notevoli in Uganimi, le classi sociali povere hanno difficoltà basilari", "la situazione sociale non è più sostenibile, la gente ha bisogno di protezione sociale e interventi medici urgenti")

data <- data.frame(Id_text, Text)

dictionary <- c("Ambiente", "Uguaglianza", "Povertà estrema", "inclusione finanziaria", "Reddito", "uguaglianza dei redditi", "Microfinanza","Non discriminazione", "Poveri e vulnerabili", "Povertà", "eliminazione della povertà", "Soglia di povertà", "Qualità della vita", "risorse", "protezione sociale", "sostenibile", "distribuzione della ricchezza")
dictionary

andresrcs · November 23, 2019, 5:37pm

Thanks, it is still not entirely clear to me but, is this close to what you are trying to accomplish?

library(tidyverse)

data <- data.frame(stringsAsFactors = FALSE,
     Id_text = c("1", "2", "3", "4"),
        Text = c("Obiettivo del progetto è migliorare i servizi di base dei Paesi in via di sviluppo. I destinatari dell'iniziativa sono la popolazione povera e vulnerabile",
                           "L'iniziativa mira a favorire l'inclusione finanziaria dei soggetti che versano in estrema povertà",
                           "Le assimmetrie nella distribuzione della ricchezza sono notevoli in Uganimi,
                           le classi sociali povere hanno difficoltà basilari",
                           "la situazione sociale non è più sostenibile,
                           la gente ha bisogno di protezione sociale e interventi medici urgenti")
)


dictionary <- c("Ambiente", "Uguaglianza", "Povertà estrema", "inclusione finanziaria",
                "Reddito", "uguaglianza dei redditi", "Microfinanza","Non discriminazione",
                "Poveri e vulnerabili", "Povertà", "eliminazione della povertà",
                "Soglia di povertà", "Qualità della vita", "risorse", "protezione sociale",
                "sostenibile", "distribuzione della ricchezza")

data %>%
    bind_cols(dictionary %>% 
                  set_names() %>% 
                  map_dfc(~str_detect(data$Text, .x)) %>% 
                  mutate_all(as.numeric)) %>% 
    as_tibble() # This is just for friendly console printing
#> # A tibble: 4 x 19
#>   Id_text Text  Ambiente Uguaglianza `Povertà estrem… `inclusione fin… Reddito
#>   <chr>   <chr>    <dbl>       <dbl>            <dbl>            <dbl>   <dbl>
#> 1 1       Obie…        0           0                0                0       0
#> 2 2       L'in…        0           0                0                1       0
#> 3 3       "Le …        0           0                0                0       0
#> 4 4       "la …        0           0                0                0       0
#> # … with 12 more variables: `uguaglianza dei redditi` <dbl>,
#> #   Microfinanza <dbl>, `Non discriminazione` <dbl>, `Poveri e
#> #   vulnerabili` <dbl>, Povertà <dbl>, `eliminazione della povertà` <dbl>,
#> #   `Soglia di povertà` <dbl>, `Qualità della vita` <dbl>, risorse <dbl>,
#> #   `protezione sociale` <dbl>, sostenibile <dbl>, `distribuzione della
#> #   ricchezza` <dbl>

^{Created on 2019-11-23 by the reprex package (v0.3.0.9000)}

system · December 14, 2019, 8:19pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.