How to use mutate and across with a dynamic column condition in the .fns argument?

joels · March 5, 2021, 6:54pm

I'm trying to mutate several columns whose column names have the same prefix and a number as suffix. Each column is mutated based on a value in another column with the corresponding suffix in its name. I can run mutate using each pair of columns explicitly. However, I'd prefer, if possible, to use a single across operation, but can't figure out how to make it work.

Below is an example: In the data below, when there's a missing value in a units column, I want to put a "U" in the corresponding status column. So, in this case, units1 has a missing value in row 2, so, in status1 we want row 2 to change from "P" to "U", and likewise for row 4 of the units3, status3 pair.

library(tidyverse)

# Fake data
set.seed(2)
d = bind_cols(
  paste0("status", 1:3) %>% 
    set_names() %>% 
    map_df(~sample(c("P","F"), 5, replace=TRUE)),
  paste0("units", 1:3) %>% 
    set_names() %>% 
    map_df(~sample(c(0:2,NA), 5, prob=c(4,4,4,1), replace=TRUE))
)

d
#> # A tibble: 5 x 6
#>   status1 status2 status3 units1 units2 units3
#>   <chr>   <chr>   <chr>    <int>  <int>  <int>
#> 1 P       F       P            1      1      2
#> 2 P       P       P           NA      2      0
#> 3 F       P       F            0      1      2
#> 4 F       P       P            2      0     NA
#> 5 F       F       P            0      2      0

This can be done by mutating each column separately, which I'd like to avoid:

# Option 1: The hard way
d %>% 
  mutate(status1 = case_when(is.na(units1) ~ "U",
                             TRUE ~ status1),
         status2 = case_when(is.na(units2) ~ "U",
                             TRUE ~ status2),
         status3 = case_when(is.na(units3) ~ "U",
                             TRUE ~ status3))
#> # A tibble: 5 x 6
#>   status1 status2 status3 units1 units2 units3
#>   <chr>   <chr>   <chr>    <int>  <int>  <int>
#> 1 P       F       P            1      1      2
#> 2 U       P       P           NA      2      0
#> 3 F       P       F            0      1      2
#> 4 F       P       U            2      0     NA
#> 5 F       F       P            0      2      0

It can also be done by pivoting to long, mutating once, then pivoting back to wide. Even though it takes just one mutate, it is still verbose and has the added complexity of having to figure out a somewhat mind-boggling pivoting operation.

# Option 2: pivot_longer
d %>% 
  rename_all(~gsub("([1-3])", "_\\1", .)) %>% 
  pivot_longer(everything(),
               names_to=c(".value", "seq"),
               names_sep="_") %>% 
  mutate(status = case_when(is.na(units) ~ "U",
                            TRUE ~ status)) %>% 
  pivot_wider(names_from=seq, values_from=c(status, units),
              names_sep="") %>% 
  unnest()

#> # A tibble: 5 x 6
#>   status1 status2 status3 units1 units2 units3
#>   <chr>   <chr>   <chr>    <int>  <int>  <int>
#> 1 P       F       P            1      1      2
#> 2 U       P       P           NA      2      0
#> 3 F       P       F            0      1      2
#> 4 F       P       U            2      0     NA
#> 5 F       F       P            0      2      0

Okay, now for the across version that I'm trying to figure out:

I use across(matches("^status"), to operate on each of the three status columns. The mutate operation needs to successively check the corresponding units column for each status column. That is, when status1 is being mutated, it needs to use units1 for the is.na() condition, and so on.

My thought was to get the numeric suffix for status using cur_column() and paste that onto "units" to get a string like "units1". But then that needs to be turned into a name for is.na() to operate on. Below are my two failed attempts. (For these examples, I've created new updated columns to compare with the original columns.)

My question is how to do this correctly. Alternatively, is there an easier/better approach than this?

# Option 3: mutate with across [NOT WORKING]

# First try
d %>% 
  mutate(across(matches("^status"), 
                ~case_when(is.na(!!as.name(paste0("units", str_extract(cur_column(), "[1-3]$")))) ~ "U",
                           TRUE ~ .),
                .names="{.col}_upd"))
#> Error: `cur_column()` must only be used inside `across()`.

# Second try
d %>% 
  mutate(across(matches("^status"), 
                ~case_when(is.na(!!rlang::parse_expr(paste0("units", str_extract(cur_column(), "[1-3]$")))) ~ "U",
                           TRUE ~ .),
                .names="{.col}_upd"))
#> Error: `cur_column()` must only be used inside `across()`.

One more thing: In the code above, both of my attempts errored. However, (strangely) this code sometimes runs without error, but gives incorrect output, as shown below.

> d %>% 
+   mutate(across(matches("^status"), 
+                 ~case_when(is.na(!!as.name(paste0("units", str_extract(cur_column(), "[1-3]$")))) ~ "U",
+                            TRUE ~ .),
+                 .names="{.col}_upd"))
# A tibble: 5 x 9
  status1 status2 status3 units1 units2 units3 status1_upd status2_upd status3_upd
  <chr>   <chr>   <chr>    <int>  <int>  <int> <chr>       <chr>       <chr>      
1 P       F       P            1      1      2 P           F           P          
2 P       P       P           NA      2      0 P           P           P          
3 F       P       F            0      1      2 F           P           F          
4 F       P       P            2      0     NA U           U           U          
5 F       F       P            0      2      0 F           F           P          
> 
> d %>% 
+   mutate(across(matches("^status"), 
+                 ~case_when(is.na(!!rlang::parse_expr(paste0("units", str_extract(cur_column(), "[1-3]$")))) ~ "U",
+                            TRUE ~ .),
+                 .names="{.col}_upd"))
# A tibble: 5 x 9
  status1 status2 status3 units1 units2 units3 status1_upd status2_upd status3_upd
  <chr>   <chr>   <chr>    <int>  <int>  <int> <chr>       <chr>       <chr>      
1 P       F       P            1      1      2 P           F           P          
2 P       P       P           NA      2      0 P           P           P          
3 F       P       F            0      1      2 F           P           F          
4 F       P       P            2      0     NA U           U           U          
5 F       F       P            0      2      0 F           F           P

Session Info:

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.5     purrr_0.3.4     readr_1.4.0     tidyr_1.1.2     tibble_3.0.6   
 [8] ggplot2_3.3.3   tidyverse_1.3.0 reprex_1.0.0    testthat_3.0.1  devtools_2.3.2  usethis_2.0.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        lubridate_1.7.9.2 prettyunits_1.1.1 ps_1.5.0          assertthat_0.2.1  rprojroot_2.0.2  
 [7] utf8_1.1.4        R6_2.5.0          cellranger_1.1.0  backports_1.2.1   httr_1.4.2        pillar_1.4.7     
[13] rlang_0.4.10      readxl_1.3.1      rstudioapi_0.13   callr_3.5.1       desc_1.2.0        munsell_0.5.0    
[19] broom_0.7.4       compiler_4.0.3    modelr_0.1.8      pkgconfig_2.0.3   pkgbuild_1.2.0    tidyselect_1.1.0 
[25] fansi_0.4.2       crayon_1.4.1      dbplyr_2.1.0      withr_2.4.1       grid_4.0.3        jsonlite_1.7.2   
[31] gtable_0.3.0      lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1    scales_1.1.1      cli_2.3.0        
[37] stringi_1.5.3     cachem_1.0.1      fs_1.5.0          remotes_2.2.0     xml2_1.3.2        ellipsis_0.3.1   
[43] generics_0.1.0    vctrs_0.3.6       tools_4.0.3       glue_1.4.2        hms_1.0.0         processx_3.4.5   
[49] pkgload_1.1.0     fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-0  sessioninfo_1.1.1 rvest_0.3.6      
[55] memoise_2.0.0     haven_2.3.1

nirgrahamuk · March 5, 2021, 9:12pm

it does seem like dplyr across is bugging in terms of the cur_column issue. would be nice if it worked as expected !

Here is a non-across alternative, that although is a little complex, has maybe simpler string manipulation and is I think explicit and in some sense easier to follow than the pivot_wider way.

case_exprs <- map_chr(1:3,
          ~glue('case_when(is.na(units{.}) ~ "U",
                        TRUE ~ status{.})')) %>% 
  rlang::parse_exprs() %>% 
  purrr::set_names(nm = glue("status{1:3}"))

d %>% 
  mutate(!!!case_exprs)

joels · March 8, 2021, 4:39pm

Thanks Nir. It's a nice workaround, but, as you say, still a bit complex. I may post this as an issue at the dplyr github site if I don't find a resolution here.

lionel · March 11, 2021, 9:00am

This is expected behaviour. The problem is that when you use !!, evaluation happens very early, before across() is evaluated. It happens when the outermost mutate() call starts running. This is essential to the mechanics of tidy evaluation but it does create confusing evaluation timing in the more complex cases.

In general I'd recommend against this sort of metaprogramming and instead try to reformulate the problem using data structures (just like you did with the pivot-longer solution). One possible way is to use df-columns, i.e. data frames inside data frames.

# Let's gather all pairs of related columns into df-cols.
# I'm not sure how this could be automated based on colnames though.
dd <- d %>% transmute(
  units1 = tibble(status = status1, value = units1),
  units2 = tibble(status = status2, value = units2),
  units3 = tibble(status = status3, value = units3)
)

dd
#> # A tibble: 5 x 3
#>   units1$status $value units2$status $value units3$status $value
#>   <chr>          <int> <chr>          <int> <chr>          <int>
#> 1 P                  1 F                  1 P                  2
#> 2 P                 NA P                  2 P                  0
#> 3 F                  0 P                  1 F                  2
#> 4 F                  2 P                  0 P                 NA
#> 5 F                  0 F                  2 P                  0

As you can see the columns inside the nested data frames all have the same names. This is going to make it easier for us to iterate over each pair of columns and refer to the names.

dd %>%
  mutate(across(
    everything(),
    ~ mutate(
      .x,
      status = if_else(is.na(value), "U", status)
    )
  ))
#> # A tibble: 5 x 3
#>   units1$status $value units2$status $value units3$status $value
#>   <chr>          <int> <chr>          <int> <chr>          <int>
#> 1 P                  1 F                  1 P                  2
#> 2 U                 NA P                  2 P                  0
#> 3 F                  0 P                  1 F                  2
#> 4 F                  2 P                  0 U                 NA
#> 5 F                  0 F                  2 P                  0

It works but now our data is all packed within these df-cols! We can unpack it manually, just like we packed it manually. I wish we had a more automatic solution for these two steps, it feels like we are missing some tools.

PS:

However, (strangely) this code sometimes runs without error, but gives incorrect output

I think this is because of a bug for which I just opened an issue at `cur_column()` isn't reset · Issue #5812 · tidyverse/dplyr · GitHub

system · April 1, 2021, 9:00am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.