Max or sum across with a condition for str_detect plus general expressions and recoding issues

Hi,
I have this simple df

data.frame <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  `Registration Date` = c("2016-04-29",
                          "2023-05-26","2021-06-09","2021-06-25","2022-06-15",
                          "2020-06-19","2016-02-23","2011-03-31","2022-06-10",
                          "2006-09-01","2018-06-29","2020-03-16","2022-05-30"),
  `Delivery Date` = c("2016-04-29",
                      "2023-05-31","2021-06-01","2021-06-30","2022-06-16",
                      "2020-06-19",NA,NA,"2022-06-11",NA,"2010-09-30",
                      "2020-03-16","2022-06-30"),
  `Call Notes` = c("it does it",
                   "Sent email yesterday, doesn't work.",
                   NA,"Customer got new car on order with Beechwood.",
                   "Booked. 10.05.2023 HP",NA,
                   "This call was automatically generated as a result of the overnight booking data feed.",NA,"Booked. 10.05.2023 HP",
                   "Booked. 31.05.2023 HP","CAN APPROACH CUSTOMER ON DAY OF BOOKING.",
                   "This call was automatically generated as a result of the overnight booking data feed.",
                   "This call was automatically generated as a result of the overnight booking data feed.")
)

data.frame

where I categorise comments using this code:

library(dplyr)
library(stringr)
data.frame <- rename(data.frame, CallNotes = 'Call Notes')
str(data.frame)

result <- data.frame %>% 
  mutate(Cat.Blank = if_else(is.na(CallNotes),1,0),
         Cat.OvernightBooking = if_else(str_detect(CallNotes, regex("overnight\\sbooking", ignore_case = TRUE, multiline = TRUE)), 1,0),
         Cat.Booked = if_else(str_detect(CallNotes, regex("booked|boooked", ignore_case = TRUE, multiline = TRUE)), 1,0),
         Cat.DoesNot.DidNot = if_else(str_detect(CallNotes, regex("does", ignore_case = TRUE, multiline = TRUE)), 1,0)) %>% 
  mutate(All = max(c_across(starts_with("Cat.")), na.rm = T)) %>% 
  mutate(Cat.Other = case_when(
    All ==0 ~ 1,
    All > 0 ~ 0))

str(result)

library(dplyr)
result <- select(result, -All)

result <- result %>% 
  mutate_at(vars(-c(1:3)), ~if_else(is.na(.), 0, .))

str(result)

but I have following issues:

  1. "Does" and "Doesn't" are categorised the same way as I cannot find a way of including expressions with an apostrophe (')
  2. My awkward way of creating category called "Other" is not working and is overcomplicated. Basically, every record not blank and not categorised should be coded as "Cat.Other". I would like to use c_across for variables including "Cat."
  3. In the end, all "Cat." variables should be 0 or 1 and my final code includes vars(-c(1:3)) which should be replaced by c_across for variables including "Cat."
  4. Finally, I believe everything should work as one coding process instead of my weird three stages.

Can anyone help me to do it proplerly?

oh, I cannot see any responses. I hope this task is not really complicated but your advise might be useful for many users like me

Do you need multiple columns for that, or was this just the best working approach for you?
Here case_when() could be used:

result <- data.frame %>% 
  mutate(Cat = case_when(
    str_detect(CallNotes, regex("overnight\\sbooking", 
                                ignore_case = TRUE, multiline = TRUE)) ~ "OvernightBooking",
    str_detect(CallNotes, regex("booked|boooked", 
                                ignore_case = TRUE, multiline = TRUE)) ~ "Booked",
    str_detect(CallNotes, regex("doesn't", ignore_case = TRUE, multiline = TRUE)) ~ "doesn't",
    str_detect(CallNotes, regex("does", ignore_case = TRUE, multiline = TRUE)) ~ "does",
    TRUE ~ "other"))

Please note, here the order of the results matter, the first hit is used, meaning you should do your logik from the most specific to the less specific. For example swapping the lines for "doesn't" and "does" will file everything under "does" as this is included in both. However, defining first the "doesn't" will prevent these from matching to "does"...

Oh, sorry if my post was not clear but str_detect should create multiresponse variables starting or ending with Cat. as some sentences in my real data set might include more than one phrase indicated. Also, BLANK flag is important.

What about this:

result <- data.frame |> 
  mutate(qry = tolower(CallNotes)) |> 
  mutate(
    Cat.Blank = is.na(CallNotes),
    Cat.OvernightBooking = str_detect(qry, "overnight\\sbooking"),
    Cat.Booked = str_detect(qry, "booked|boooked"),
    Cat.DoesNot.DidNot = str_detect(qry, "doesn't")
  ) |> 
  mutate(across(starts_with("Cat."), ~replace_na(.x, FALSE))) |> 
  rowwise() |> 
  mutate(Cat.Other = sum(c_across(starts_with("Cat."))) == 0) |> 
  mutate(across(where(is.logical), as.integer)) |> 
  select(-qry)

thank you but I have following error:

Error in `mutate()`:
ℹ In argument: `across(starts_with("Cat."), ~replace_na(.x, FALSE))`.
Caused by error in `across()`:
! Can't compute column `Cat.Blank`.
Caused by error in `replace_na()`:
! could not find function "replace_na"
Run `rlang::last_error()` to see where the error occurred.

Also, I wanted to use a code taking into account variables including "Cat." in across as my real data frame includes other logical and integer variables. Let's use this example:



data.frame <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  `Registration Date` = c("2016-04-29",
                          "2023-05-26","2021-06-09","2021-06-25","2022-06-15",
                          "2020-06-19","2016-02-23","2011-03-31","2022-06-10",
                          "2006-09-01","2018-06-29","2020-03-16","2022-05-30"),
  `Delivery number` = c(1,6,7,8,4,5,6,3,4,5,6,7,8),
  `Call Notes` = c("it does it",
                   "Sent email yesterday, doesn't work.",
                   NA,"Customer got new car on order with Beechwood.",
                   "Booked. 10.05.2023 HP",NA,
                   "This call was automatically generated as a result of the overnight booking data feed.",NA,"Booked. 10.05.2023 HP",
                   "Booked. 31.05.2023 HP","CAN APPROACH CUSTOMER ON DAY OF BOOKING.",
                   "This call was automatically generated as a result of the overnight booking data feed.",
                   "This call was automatically generated as a result of the overnight booking data feed.")
)

data.frame

I am also using

data.frame <- rename(data.frame, CallNotes = 'Call Notes')

to remove spaces from a variable name as I have issues with using "spaced variables" with dplyr. Perhaps there is a simple code removing spaces in all variable names prior to mutate?

It works for me, if I use:

data.frame <- dplyr::rename(data.frame, CallNotes = 'Call Notes')

replace_na requires tidyr library, just add

library(tidyr)
1 Like

perfect, I meant using variable names with spaces does not always work in the code so I remove spaces before I use any variables in a code. Is there any way of removing spaces in all df variables in dplyr? Rename code above works for CallNotes only but what if we have multiple variables with spaces? Do we need to rename them one by one or is there any clever way of renaming them all in one go?

No, we don't.
Here you are:

names(data.frame) <- names(data.frame) %>% stringr::str_replace_all("\\s","")

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.