Max or sum across with a condition for str_detect plus general expressions and recoding issues

Slavek · July 20, 2023, 4:26pm

Hi,
I have this simple df

data.frame <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  `Registration Date` = c("2016-04-29",
                          "2023-05-26","2021-06-09","2021-06-25","2022-06-15",
                          "2020-06-19","2016-02-23","2011-03-31","2022-06-10",
                          "2006-09-01","2018-06-29","2020-03-16","2022-05-30"),
  `Delivery Date` = c("2016-04-29",
                      "2023-05-31","2021-06-01","2021-06-30","2022-06-16",
                      "2020-06-19",NA,NA,"2022-06-11",NA,"2010-09-30",
                      "2020-03-16","2022-06-30"),
  `Call Notes` = c("it does it",
                   "Sent email yesterday, doesn't work.",
                   NA,"Customer got new car on order with Beechwood.",
                   "Booked. 10.05.2023 HP",NA,
                   "This call was automatically generated as a result of the overnight booking data feed.",NA,"Booked. 10.05.2023 HP",
                   "Booked. 31.05.2023 HP","CAN APPROACH CUSTOMER ON DAY OF BOOKING.",
                   "This call was automatically generated as a result of the overnight booking data feed.",
                   "This call was automatically generated as a result of the overnight booking data feed.")
)

data.frame

where I categorise comments using this code:

library(dplyr)
library(stringr)
data.frame <- rename(data.frame, CallNotes = 'Call Notes')
str(data.frame)

result <- data.frame %>% 
  mutate(Cat.Blank = if_else(is.na(CallNotes),1,0),
         Cat.OvernightBooking = if_else(str_detect(CallNotes, regex("overnight\\sbooking", ignore_case = TRUE, multiline = TRUE)), 1,0),
         Cat.Booked = if_else(str_detect(CallNotes, regex("booked|boooked", ignore_case = TRUE, multiline = TRUE)), 1,0),
         Cat.DoesNot.DidNot = if_else(str_detect(CallNotes, regex("does", ignore_case = TRUE, multiline = TRUE)), 1,0)) %>% 
  mutate(All = max(c_across(starts_with("Cat.")), na.rm = T)) %>% 
  mutate(Cat.Other = case_when(
    All ==0 ~ 1,
    All > 0 ~ 0))

str(result)

library(dplyr)
result <- select(result, -All)

result <- result %>% 
  mutate_at(vars(-c(1:3)), ~if_else(is.na(.), 0, .))

str(result)

but I have following issues:

"Does" and "Doesn't" are categorised the same way as I cannot find a way of including expressions with an apostrophe (')
My awkward way of creating category called "Other" is not working and is overcomplicated. Basically, every record not blank and not categorised should be coded as "Cat.Other". I would like to use c_across for variables including "Cat."
In the end, all "Cat." variables should be 0 or 1 and my final code includes vars(-c(1:3)) which should be replaced by c_across for variables including "Cat."
Finally, I believe everything should work as one coding process instead of my weird three stages.

Can anyone help me to do it proplerly?

Slavek · July 26, 2023, 8:42am

oh, I cannot see any responses. I hope this task is not really complicated but your advise might be useful for many users like me

Matthias · July 26, 2023, 9:42am

Do you need multiple columns for that, or was this just the best working approach for you?
Here case_when() could be used:

result <- data.frame %>% 
  mutate(Cat = case_when(
    str_detect(CallNotes, regex("overnight\\sbooking", 
                                ignore_case = TRUE, multiline = TRUE)) ~ "OvernightBooking",
    str_detect(CallNotes, regex("booked|boooked", 
                                ignore_case = TRUE, multiline = TRUE)) ~ "Booked",
    str_detect(CallNotes, regex("doesn't", ignore_case = TRUE, multiline = TRUE)) ~ "doesn't",
    str_detect(CallNotes, regex("does", ignore_case = TRUE, multiline = TRUE)) ~ "does",
    TRUE ~ "other"))

Please note, here the order of the results matter, the first hit is used, meaning you should do your logik from the most specific to the less specific. For example swapping the lines for "doesn't" and "does" will file everything under "does" as this is included in both. However, defining first the "doesn't" will prevent these from matching to "does"...

Slavek · July 26, 2023, 10:32am

Oh, sorry if my post was not clear but str_detect should create multiresponse variables starting or ending with Cat. as some sentences in my real data set might include more than one phrase indicated. Also, BLANK flag is important.

MarekGierlinski · July 26, 2023, 10:58am

What about this:

result <- data.frame |> 
  mutate(qry = tolower(CallNotes)) |> 
  mutate(
    Cat.Blank = is.na(CallNotes),
    Cat.OvernightBooking = str_detect(qry, "overnight\\sbooking"),
    Cat.Booked = str_detect(qry, "booked|boooked"),
    Cat.DoesNot.DidNot = str_detect(qry, "doesn't")
  ) |> 
  mutate(across(starts_with("Cat."), ~replace_na(.x, FALSE))) |> 
  rowwise() |> 
  mutate(Cat.Other = sum(c_across(starts_with("Cat."))) == 0) |> 
  mutate(across(where(is.logical), as.integer)) |> 
  select(-qry)

Slavek · July 26, 2023, 11:18am

thank you but I have following error:

Error in `mutate()`:
ℹ In argument: `across(starts_with("Cat."), ~replace_na(.x, FALSE))`.
Caused by error in `across()`:
! Can't compute column `Cat.Blank`.
Caused by error in `replace_na()`:
! could not find function "replace_na"
Run `rlang::last_error()` to see where the error occurred.

Also, I wanted to use a code taking into account variables including "Cat." in across as my real data frame includes other logical and integer variables. Let's use this example:



data.frame <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  `Registration Date` = c("2016-04-29",
                          "2023-05-26","2021-06-09","2021-06-25","2022-06-15",
                          "2020-06-19","2016-02-23","2011-03-31","2022-06-10",
                          "2006-09-01","2018-06-29","2020-03-16","2022-05-30"),
  `Delivery number` = c(1,6,7,8,4,5,6,3,4,5,6,7,8),
  `Call Notes` = c("it does it",
                   "Sent email yesterday, doesn't work.",
                   NA,"Customer got new car on order with Beechwood.",
                   "Booked. 10.05.2023 HP",NA,
                   "This call was automatically generated as a result of the overnight booking data feed.",NA,"Booked. 10.05.2023 HP",
                   "Booked. 31.05.2023 HP","CAN APPROACH CUSTOMER ON DAY OF BOOKING.",
                   "This call was automatically generated as a result of the overnight booking data feed.",
                   "This call was automatically generated as a result of the overnight booking data feed.")
)

data.frame

I am also using

data.frame <- rename(data.frame, CallNotes = 'Call Notes')

to remove spaces from a variable name as I have issues with using "spaced variables" with dplyr. Perhaps there is a simple code removing spaces in all variable names prior to mutate?

Andrzej · July 26, 2023, 11:24am

It works for me, if I use:

data.frame <- dplyr::rename(data.frame, CallNotes = 'Call Notes')

MarekGierlinski · July 26, 2023, 11:35am

replace_na requires tidyr library, just add

library(tidyr)

Slavek · July 26, 2023, 11:45am

perfect, I meant using variable names with spaces does not always work in the code so I remove spaces before I use any variables in a code. Is there any way of removing spaces in all df variables in dplyr? Rename code above works for CallNotes only but what if we have multiple variables with spaces? Do we need to rename them one by one or is there any clever way of renaming them all in one go?

Andrzej · July 26, 2023, 3:20pm

No, we don't.
Here you are:

names(data.frame) <- names(data.frame) %>% stringr::str_replace_all("\\s","")

system · August 2, 2023, 3:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.