Hello!
I have 10114 (and counting) text messages in 4 different languages I would like to code for analysis.
I'm looking for a concise way to label/recode the text messages based on keywords/text within the message.
The data looks similar to this:
library(tidyverse)
library(knitr)
msg <- tibble::tribble(
~ID, ~text,
1, "Please call me from Jane, sent on: Mar 1, 2019",
2, "Please call me from Dan, sent on: Feb 5, 2018",
3, "Please call me from Ben, sent on: Mar 9, 2017",
4, "Reminder to do something Jane, sent on: Apr 1, 2016",
5, "Reminder to do this Dan, sent on: Jun 14, 2019",
6, "Reminder to do something else Ben, sent on: Jan 1, 2018"
)
msg %>% kable ()
ID | text |
---|---|
1 | Please call me from Jane, sent on: Mar 1, 2019 |
2 | Please call me from Dan, sent on: Feb 5, 2018 |
3 | Please call me from Ben, sent on: Mar 9, 2017 |
4 | Reminder to do something Jane, sent on: Apr 1, 2016 |
5 | Reminder to do this Dan, sent on: Jun 14, 2019 |
6 | Reminder to do something else Ben, sent on: Jan 1, 2018 |
I would like to add a label
variable to classify each message based on its contents to use in further analysis. For example:
library(tidyverse)
library(knitr)
msg_lab <- tibble::tribble(
~ID, ~text, ~label,
1, "Please call me from Jane, sent on: Mar 1, 2019", "Call me",
2, "Please call me from Dan, sent on: Feb 5, 2018", "Call me",
3, "Please call me from Ben, sent on: Mar 9, 2017", "Call me",
4, "Reminder to do something Jane, sent on: Apr 1, 2016", "Reminder",
5, "Reminder to do this Dan, sent on: Jun 14, 2019", "Reminder",
6, "Reminder to do something else Ben, sent on: Jan 1, 2018", "Reminder"
)
msg_lab %>% kable()
ID | text | label |
---|---|---|
1 | Please call me from Jane, sent on: Mar 1, 2019 | Call me |
2 | Please call me from Dan, sent on: Feb 5, 2018 | Call me |
3 | Please call me from Ben, sent on: Mar 9, 2017 | Call me |
4 | Reminder to do something Jane, sent on: Apr 1, 2016 | Reminder |
5 | Reminder to do this Dan, sent on: Jun 14, 2019 | Reminder |
6 | Reminder to do something else Ben, sent on: Jan 1, 2018 | Reminder |
table(msg_lab$label)
#>
#> Call me Reminder
#> 3 3
I'm trying to use the fct_recode
function from the forcats
package. This solution works, however, it isn't feasible for my data set.
library(forcats)
msg_lab <- msg %>%
mutate(label = fct_recode(text,
"Call me" = "Please call me from Jane, sent on: Mar 1, 2019",
"Call me" = "Please call me from Dan, sent on: Feb 5, 2018",
"Call me" = "Please call me from Ben, sent on: Mar 9, 2017",
"Reminder" = "Reminder to do something Jane, sent on: Apr 1, 2016",
"Reminder" = "Reminder to do this Dan, sent on: Jun 14, 2019",
"Reminder" = "Reminder to do something else Ben, sent on: Jan 1, 2018"
))
table(msg_lab$label)
#>
#> Call me Reminder
#> 3 3
I tried to use the str_detect
function from the stringr
package to determine the presence of the key words used in my labels. Unfortunately, this results in an error.
library(stringr)
str_detect("Please call me from Jane, sent on: Mar 1, 2019", "call me")
#> [1] TRUE
msg_lab <- msg %>%
mutate(label = fct_recode(text,
"Call me" = str_detect(text, "call me"),
"Call me" = str_detect(text, "Reminder")
))
#> Error: Each input to fct_recode must be a single named string. Problems at positions: 1, 2
I would appreciate some pointers on how to do this!