I have written a function to "deduplicate" a dataframe.
A unique row is based on having a unique combination of one of more ID (ie key) fields
Deciding which duplicate to keep is based on taking the row with the maximum of a single field.
Example below.
But what what if the maximum also needed to be a "composite" - so I can specify the row to keep as the one with a maximum based on multiple fields?
I can see that the new "by" functionality is by the joins in dplyr 1.1.0 might be a way in. But can't see how I would use them.
library(tidyverse)
DeDupByMax = function(dfin , maxfield, ... ) {
if (missing(...)) {
cat("DeDupByMax unique IDs missing\n")
stop()
}
quoID <- enquos(...)
maxfieldquo <- enquo(maxfield)
dfout <- dfin %>%
arrange(!!!quoID, desc(!!maxfieldquo)) %>%
distinct(!!!quoID, .keep_all = TRUE)
}
TodayDate = Sys.Date()
df = tibble(unique = c(1,1,2,3), other = 1:4, datefield = c(TodayDate, TodayDate - 1, TodayDate, TodayDate))
print(df)
#> # A tibble: 4 × 3
#> unique other datefield
#> <dbl> <int> <date>
#> 1 1 1 2023-02-24
#> 2 1 2 2023-02-23
#> 3 2 3 2023-02-24
#> 4 3 4 2023-02-24
dfdedup = DeDupByMax(df, datefield, unique)
print(dfdedup)
#> # A tibble: 3 × 3
#> unique other datefield
#> <dbl> <int> <date>
#> 1 1 1 2023-02-24
#> 2 2 3 2023-02-24
#> 3 3 4 2023-02-24
Created on 2023-02-24 with reprex v2.0.2