If there are duplicates in a df, keep one according to a rule

mribeirodantas · April 12, 2020, 5:08am

Let's say I have the following dataframe:

users <-  data.frame(name = c('John', 'John', 'Bob'),
                                 age = c(18, 18, 28),
                                 country = c('Brazil', 'Brazil', 'US'),
                                 Grade = c('A', 'B', 'C'))

If I run the code below, only the first and third row will be kept.

users %>%
  distinct(name, age, country, .keep_all = TRUE)

However, I would like to keep the second John. Whenever there is a duplicate, the one with the lower grade should be chosen. Or maybe the one in which the grade column has a string containing a substr or something like this. How can I do this in a Tidyverse-way?

dan_miller · April 12, 2020, 6:52am

In essence you just need to group_by on the variables that you want to remove duplicates on (so in your example 'name') and then filter on the variable that you want to make the decision on. So for your example:

users <-  data.frame(name = c('John', 'John', 'Bob'),
                                 age = c(18, 18, 28),
                                 country = c('Brazil', 'Brazil', 'US'),
                                 Grade = c('A', 'B', 'C'))

users %>% 
   group_by(name) %>% 
   filter(as.character(Grade) == max(as.character(Grade)))

For your second example, looking for a substring, you can use str_detect in the filter argument

system · April 19, 2020, 6:52am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.