I have some free text data that I'm trying to recategorize. The data arises from health care coordinator contacts with patients and caregivers, which can be phone calls, emails, or text messages. I'm trying to use str_detect() with two wildcards and am getting a syntax error. Here's a reprex containing dummy data (not actual patient data).


commdf <- tribble(
  ~case, ~purpose,
  1,     "set up visit",
  2,     "left message with client",
  3,     "Texted about visit",
  4,     "left voicemail",
  5,     "communication about appointment",
  6,     "phone call",
  7,     "Emailed client",
  8,     "client called back",
  10,    "texted client"

commdf <- commdf %>% 
  mutate(commtype = case_when(
    str_detect(str_to_lower(purpose), "*call*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*spoke*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*message*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*phone*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*discuss*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*reported*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*set up*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*confirm*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*sched*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*communicat*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*voicemail*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*vm*") == TRUE ~ "call",
    str_detect(str_to_lower(purpose), "*text*") == TRUE ~ "text",
    str_detect(str_to_lower(purpose), "*txt*") == TRUE ~ "text",
    str_detect(str_to_lower(purpose), "*email*") == TRUE ~ "email",
    str_detect(str_to_lower(purpose), "*e-mail*") == TRUE ~ "email",
    TRUE ~ NA
#> Error in mutate_impl(.data, dots): Evaluation error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX).

I'm not sure what's causing the error, but I wonder if it's from using two wildcard asterisks. Is this use legitimate? Is there a better way to go about this?

Also, I'd like to condense this code further by using something like c("*call*", "*spoke*", "*message*", ...) within str_detect(), but first need to figure out the regex error I'm getting.

If you just want the text, e.g. "call" you don't need the *.

If you are also looking for the asterisk characters then you need to escape them, i.e. "\*call\*".


Oh wow - looks like I've been using completely unnecessary asterisks! I thought they were needed to capture buried substrings - for example, call within phonecalls, called, etc. Thanks, Martin!

You could shorten the code quite a bit by combining some of the regular expressions. For example:

commdf <- commdf %>% 
  mutate(commtype = case_when(
    str_detect(str_to_lower(purpose), "call|spoke|message|phone|discuss|reported|set up|confirm|sched|communicat|voicemail|vm") ~ "call",
    str_detect(str_to_lower(purpose), "te?xt") ~ "text",
    str_detect(str_to_lower(purpose), "e-?mail") ~ "email",
    TRUE ~ NA_character_

Perfect, this is exactly what I needed. Thank you Joel!

Dropping in to plug regexplain again (and its inspiration, RegExr). I’ve found that being able to see what your regex is matching in some of your own sample data, reactively updating as you fiddle, is huge for flattening the learning curve.

And regexplain will add the extra escape characters necessary when using regex in R for you! (You know what’s less fun than debugging missing escapes in a regex? debugging missing double escapes in a regex :stuck_out_tongue_closed_eyes:)


This will definitely make life easier - thank you!

