How to delete duplicates by two columns and keep the last one based on an id

Hello, I would like to delete the duplicates based on two columns in the below data and keep the last one based on id. If someone can help me, it is highly appreciate it.

data <- data.frame(
        stringsAsFactors = FALSE,
        identifer=c("a","a","a","b","b","b","c","c","c"),
        id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L, 3L),
        date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-15")
        
        )

Is this what you're aiming for?

data <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c","c"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L, 3L),
  date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-15")
  
)

dplyr::distinct(data, identifer, date, .keep_all = T)
#>   identifer id       date
#> 1         a  1 2020-01-25
#> 2         b  1 2021-02-10
#> 3         c  1 2021-03-15

Created on 2022-03-27 by the reprex package (v2.0.1)

Thank you Jack for the quick response and help. I would like to keep the obs with id=3 for this data if you can help with that.

Ah so that would look like this:

data <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c","c"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L, 3L),
  date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-15")
  
)

dplyr::filter(data, id == 3)
#>   identifer id       date
#> 1         a  3 2020-01-25
#> 2         b  3 2021-02-10
#> 3         c  3 2021-03-15

Hi Jack. Thank you again for your help. My actual data has over 1000 obs. I would like to fiter the repeated obs by identifier and date and keep the last obs based on id. Id is different for each identifier and ranges (from 1- 5), but I would like to keep the last obs. I revised the date below that give you better clues.

data <- data.frame(
        stringsAsFactors = FALSE,
        identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
        id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
        date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-17","2021-03-17","2021-03-17","2021-03-17")
        
        )

Is this what you are after?

data <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
  date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-17","2021-03-17","2021-03-17","2021-03-17")
  
)
data
#>    identifer id       date
#> 1          a  1 2020-01-25
#> 2          a  2 2020-01-25
#> 3          a  3 2020-01-25
#> 4          b  1 2021-02-10
#> 5          b  2 2021-02-10
#> 6          b  3 2021-02-10
#> 7          c  1 2021-03-15
#> 8          c  2 2021-03-15
#> 9          d  1 2021-03-17
#> 10         d  2 2021-03-17
#> 11         d  3 2021-03-17
#> 12         d  4 2021-03-17
library(dplyr)

MaxID <- data %>% group_by(identifer, date) %>% summarize(id = max(id))
#> `summarise()` has grouped output by 'identifer'. You can override using the
#> `.groups` argument.
MaxID
#> # A tibble: 4 × 3
#> # Groups:   identifer [4]
#>   identifer date          id
#>   <chr>     <chr>      <int>
#> 1 a         2020-01-25     3
#> 2 b         2021-02-10     3
#> 3 c         2021-03-15     2
#> 4 d         2021-03-17     4
data2 <- semi_join(data, MaxID, by = c("identifer", "id", "date"))
data2
#>   identifer id       date
#> 1         a  3 2020-01-25
#> 2         b  3 2021-02-10
#> 3         c  2 2021-03-15
#> 4         d  4 2021-03-17

Created on 2022-03-27 by the reprex package (v0.2.1)

1 Like

Thank you very much FJCC and Jack. It is exactly what I want.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.