How to pass 2 groups of columns to a function

mikecrobp · February 24, 2023, 9:59am

I have written a function to "deduplicate" a dataframe.
A unique row is based on having a unique combination of one of more ID (ie key) fields
Deciding which duplicate to keep is based on taking the row with the maximum of a single field.
Example below.

But what what if the maximum also needed to be a "composite" - so I can specify the row to keep as the one with a maximum based on multiple fields?

I can see that the new "by" functionality is by the joins in dplyr 1.1.0 might be a way in. But can't see how I would use them.

library(tidyverse)

DeDupByMax = function(dfin , maxfield, ... ) {
  
  if (missing(...)) {
    cat("DeDupByMax unique IDs missing\n")
    stop()
  }
  quoID <- enquos(...)
  maxfieldquo <- enquo(maxfield)
  
  dfout <- dfin %>%
    arrange(!!!quoID, desc(!!maxfieldquo)) %>%
    distinct(!!!quoID, .keep_all = TRUE)
}

TodayDate = Sys.Date()

df = tibble(unique = c(1,1,2,3), other = 1:4, datefield = c(TodayDate, TodayDate - 1, TodayDate, TodayDate))

print(df)
#> # A tibble: 4 × 3
#>   unique other datefield 
#>    <dbl> <int> <date>    
#> 1      1     1 2023-02-24
#> 2      1     2 2023-02-23
#> 3      2     3 2023-02-24
#> 4      3     4 2023-02-24

dfdedup = DeDupByMax(df, datefield, unique)

print(dfdedup)
#> # A tibble: 3 × 3
#>   unique other datefield 
#>    <dbl> <int> <date>    
#> 1      1     1 2023-02-24
#> 2      2     3 2023-02-24
#> 3      3     4 2023-02-24

^{Created on 2023-02-24 with reprex v2.0.2}

pieterjanvc · February 24, 2023, 12:35pm

Hello,

The way you might want to think about this differently is as a group_by() problem.
Here is an example

library(tidyverse)

set.seed(1) #Only needed for reproducibility 
df = data.frame(
  key1 = c(1,2),
  key2 = c(1,2,3,4),
  val1 = runif(12),
  val2 = runif(12)
)
df
#>    key1 key2       val1      val2
#> 1     1    1 0.26550866 0.6870228
#> 2     2    2 0.37212390 0.3841037
#> 3     1    3 0.57285336 0.7698414
#> 4     2    4 0.90820779 0.4976992
#> 5     1    1 0.20168193 0.7176185
#> 6     2    2 0.89838968 0.9919061
#> 7     1    3 0.94467527 0.3800352
#> 8     2    4 0.66079779 0.7774452
#> 9     1    1 0.62911404 0.9347052
#> 10    2    2 0.06178627 0.2121425
#> 11    1    3 0.20597457 0.6516738
#> 12    2    4 0.17655675 0.1255551

#Filter by multiple values sequentially
df %>% group_by(key1, key2) %>% 
  filter(val1 == max(val1)) %>% 
  filter(val2 == max(val2)) %>% 
  ungroup() %>% distinct()
#> # A tibble: 4 × 4
#>    key1  key2  val1  val2
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     2     4 0.908 0.498
#> 2     2     2 0.898 0.992
#> 3     1     3 0.945 0.380
#> 4     1     1 0.629 0.935

#Filter by composite value (no keys removed)
df %>% group_by(key1, key2) %>% 
  filter((val1 + val2) == max(val1 + val2)) %>% 
  ungroup() %>% distinct()
#> # A tibble: 4 × 4
#>    key1  key2  val1  val2
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     3 0.573 0.770
#> 2     2     2 0.898 0.992
#> 3     2     4 0.661 0.777
#> 4     1     1 0.629 0.935

#Filter by composite value (keys removed)
df %>% group_by(key1, key2) %>% 
  filter(val1 == max(val1) & val2 == max(val2)) %>% 
  ungroup() %>% distinct()
#> # A tibble: 2 × 4
#>    key1  key2  val1  val2
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     2     2 0.898 0.992
#> 2     1     1 0.629 0.935

^{Created on 2023-02-24 by the reprex package (v2.0.1)}

As shown in the example above, when you try to design a filtering system that will always return a row for each unique key, you have to be careful what composite logic you use (maximum based on multiple fields as you give as an example).

Hope this helps,
PJ

mikecrobp · February 24, 2023, 1:29pm

Funnily enough, my first version of this function was written in terms of group_by and filter. But:
arrange/distinct was much faster
I think it struggled with NAs in a way that arrange/distinct does not

But actually my question is less about the method so much as how to pass 2 groups of columns into a function. inner_join etc have join_by() and reframe has .by. I suspect I need to be able to call

DeDupByMax(dfin , maxfields = .by(c1,c2,c3), IDfields = .by(c4,c5,c6))

But how do I unravel the .by arguments in the function>

Mike

pieterjanvc · February 24, 2023, 3:54pm

Hi,

I suggest you look more into the dplyr data masking as this will help you using more recent syntax.

Here is an example of passing multiple variables at once:

library(tidyverse)

set.seed(1) #Only needed for reproducibility 
df = data.frame(
  key1 = c(1,2),
  key2 = c(1,2,3,4),
  val1 = runif(12),
  val2 = runif(12)
)

myFun = function(df, gb, ar){
  df %>% group_by(across({{gb}})) %>% 
    summarise(n = n(), .groups = "drop") %>% 
    arrange(across({{ar}}))
}

myFun(df, c(key1, key2), c(key2, key1))
#> # A tibble: 4 × 3
#>    key1  key2     n
#>   <dbl> <dbl> <int>
#> 1     1     1     3
#> 2     2     2     3
#> 3     1     3     3
#> 4     2     4     3

^{Created on 2023-02-24 by the reprex package (v2.0.1)}

Note that this is only possible if the vector of col names does not contain any functions. For example, if you wanted to sort by desc(key2), key1 you would have to do:

myFun = function(df, gb, ...){
  df %>% group_by(across({{gb}})) %>% 
    summarise(n = n(), .groups = "drop") %>% 
    arrange(...)
}

myFun(df, c(key1, key2), desc(key2), key1)
#> # A tibble: 4 × 3
#>    key1  key2     n
#>   <dbl> <dbl> <int>
#> 1     2     4     3
#> 2     1     3     3
#> 3     2     2     3
#> 4     1     1     3

^{Created on 2023-02-24 by the reprex package (v2.0.1)}

I'm still not good at this myself, but at least it might point you in the right direction

Hope this helps,
PJ

mikecrobp · March 3, 2023, 3:23pm

Just to close the loop - I found and documented the answer here Pass orderby to dplyr arrange - #5 by mikecrobp

system · March 10, 2023, 3:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.