dplyr case_when must be a two-sided formula not a character vector

afp · April 17, 2023, 6:53am

I'm having trouble running an R script via a Bitbucket pipeline. The script runs on my local machine, but does not work when run in the Bitbucket environment. The relevant part of the script is:

opportunities <- glue("
                      SELECT Primary_Campaign_Name
                      FROM Opportunity
                      ") %>%
  sf_query(object_name = "Opportunity", 
           api_type = "Bulk 2.0",
           guess_types = FALSE) %>%
  mutate(Year = case_when(str_detect(Primary_Campaign_Name, "RCAU19") ~ 2019,
                          str_detect(Primary_Campaign_Name, "RCAU20") ~ 2020,
                          str_detect(Primary_Campaign_Name, "RCAU21") ~ 2021,
                          TRUE ~ 2022))

The error that I'm getting is:

Error in `mutate()`:
! Problem while computing `Year = case_when(...)`.
Caused by error in `case_when()`:
! Case 11 (`str_detect(Primary_Campaign_Name, "RCAU19") ~ 2019`) must be
  a two-sided formula, not a double vector.

Not sure how to troubleshoot this, because it's working fine on my computer.

I have dplyr version 1.1.1 which I believe is the latest, and my Dockerfile command to install dplyr is the following which I assume would install the same (latest) version of the package.

RUN install2.r --error \
  DBI \
  dplyr \
  glue \
  lubridate \
  readxl \
  remotes \
  stringr

Any advice or pointers on how to troubleshoot this would be appreciated.

technocrat · April 17, 2023, 8:19am

Short answer: stringr::str_detect() takes an input vector and a pattern then returns a logical vector consisting of some combination of TRUE and FALSE and it's not getting it. Instead it's getting an object derived from taking the return of glue() piped to sf_query() as a SOQL query against the Salesforce object "Opportunity": which returns a tdb_df.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
d <- data.frame(primary = paste0("RCAU",2019:2022))
d %>% select(primary)%>% 
  mutate(when_to = 
  case_when(
    str_detect(.,"RCAU2019") ~ 2019,
    str_detect(.,"RCAU2020") ~ 2020,
    str_detect(.,"RCAU2021") ~ 2021,
    str_detect(.,"RCAU2022") ~ 2022
  )
)
#> Warning: There were 4 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `when_to = case_when(...)`.
#> Caused by warning in `stri_detect_regex()`:
#> ! argument is not an atomic vector; coercing
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
#>    primary when_to
#> 1 RCAU2019    2019
#> 2 RCAU2020    2019
#> 3 RCAU2021    2019
#> 4 RCAU2022    2019

# or reset d and use base
d <- data.frame(primary = paste0("RCAU",2019:2022))
pat <- "RCAU"
d$when_to <- as.numeric(sub(pat,"",d[1][[1]]))
d
#>    primary when_to
#> 1 RCAU2019    2019
#> 2 RCAU2020    2020
#> 3 RCAU2021    2021
#> 4 RCAU2022    2022

# explication

d[1] |> str() # object is a data frame
#> 'data.frame':    4 obs. of  1 variable:
#>  $ primary: chr  "RCAU2019" "RCAU2020" "RCAU2021" "RCAU2022"
d[1][[1]] |> str() # object is a vector
#>  chr [1:4] "RCAU2019" "RCAU2020" "RCAU2021" "RCAU2022"

# when offered data with beautifully simple
# structure, exploit it, here the source
# variable is an 8-character string beginning
# with "RCAU" and what is sought is a numeric
# version of what is left after removing it

# pattern to be removed

pat
#> [1] "RCAU"

# use sub() to find pat and replace it with
# nothing (= "") in the vector d[1][[1]]
# (the first column of the data frame in 
# unlisted form)

as.numeric(sub(pat,"",d[1][[1]]))
#> [1] 2019 2020 2021 2022

# create a new variable with this

d$when_to <- as.numeric(sub(pat,"",d[1][[1]]))

# show the result

d
#>    primary when_to
#> 1 RCAU2019    2019
#> 2 RCAU2020    2020
#> 3 RCAU2021    2021
#> 4 RCAU2022    2022

^{Created on 2023-04-17 with reprex v2.0.2}

The difficulty in using the tidyverse approach is that piping with %>% makes it easy to lose track of what arguments are in scope next to be evaluated, especially when relying on implicit passing as when mutate() is piped a data frame object with no indication of what part of it to use, as in the question. Borrowing from Zen of Python

Explicit is better than implicit.

The virtue of the base approach is that everything is explicit. Beginning with the paradigm of school algebra—f(x) = y we begin with

d in the role of x, what is to hand
d$when_to in the role of y what is to be added to d
with f a composite function

d$when_to uses the $ operator to designate a variable name in d
<- assigns a value to d from what follows
as.numeric() converts its argument to a number
sub() searches for the defined pat string, and replaces it with an empty string "" from d[1][[1]]
d[1] selects the first column of d which is where what we are really interested in resides in d
d[1][[1]] converts d[1] from a list object to a vector object

Every step of the process is something that can be looked at and understood without having to keep in mind something further up the chain.

This would be easier, of course, without the pernicious habit of using data frames unnecessarily.

(m <- matrix(paste0("RCAU",2019:2022),ncol = 1))
#>      [,1]      
#> [1,] "RCAU2019"
#> [2,] "RCAU2020"
#> [3,] "RCAU2021"
#> [4,] "RCAU2022"
(y = sub("RCAU","",m[,1]))
#> [1] "2019" "2020" "2021" "2022"
(d <- data.frame(primary = m, when_to = y))
#>    primary when_to
#> 1 RCAU2019    2019
#> 2 RCAU2020    2020
#> 3 RCAU2021    2021
#> 4 RCAU2022    2022

A data frame exists to accommodate a mix of variables that are character and numeric. When variables can be isolated to be all one or the other, a matrix is a much more convenient object type because it is just a regular collection of values arranged in rows and columns, addressed with m[row_number,col_number] which means that you can apply a function to everything f(m), to the specific rows and columns f(m[1:2,c(8,9)], etc. In conjunction with which() and friends selecting and filtering based on logical tests is simple.

afp · April 18, 2023, 12:26am

Wow thank you so much for that lengthy and detailed explanation, super appreciated!

afp · April 18, 2023, 12:42am

Doing some digging around this morning, I realised that if I use TRUE instead of .default in the last line of case_when then it works as I had intended. (I know that in my example I have used TRUE but I stripped my actual use-case to use as an example, and in actuality I had used .default.)

Would you know what it is that causes that change in behaviour? Is that an intentional change??

But regardless, thanks so much for the awesome response above!

technocrat · April 18, 2023, 7:30am

Afraid I can't help with the follow-up question. My syntax retention is challenged and when it gets down into the weeds with tidyverse, I turn to [base}.

system · April 25, 2023, 7:31am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.