How to delete duplicates by two columns and keep the last one based on an id

Khoajak · March 27, 2022, 2:30pm

Hello, I would like to delete the duplicates based on two columns in the below data and keep the last one based on id. If someone can help me, it is highly appreciate it.

data <- data.frame(
        stringsAsFactors = FALSE,
        identifer=c("a","a","a","b","b","b","c","c","c"),
        id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L, 3L),
        date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-15")
        
        )

JackDavison · March 27, 2022, 2:42pm

Is this what you're aiming for?

data <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c","c"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L, 3L),
  date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-15")
  
)

dplyr::distinct(data, identifer, date, .keep_all = T)
#>   identifer id       date
#> 1         a  1 2020-01-25
#> 2         b  1 2021-02-10
#> 3         c  1 2021-03-15

^{Created on 2022-03-27 by the reprex package (v2.0.1)}

Khoajak · March 27, 2022, 6:07pm

Thank you Jack for the quick response and help. I would like to keep the obs with id=3 for this data if you can help with that.

JackDavison · March 27, 2022, 7:27pm

Ah so that would look like this:

data <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c","c"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L, 3L),
  date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-15")
  
)

dplyr::filter(data, id == 3)
#>   identifer id       date
#> 1         a  3 2020-01-25
#> 2         b  3 2021-02-10
#> 3         c  3 2021-03-15

Khoajak · March 28, 2022, 2:30am

Hi Jack. Thank you again for your help. My actual data has over 1000 obs. I would like to fiter the repeated obs by identifier and date and keep the last obs based on id. Id is different for each identifier and ranges (from 1- 5), but I would like to keep the last obs. I revised the date below that give you better clues.

data <- data.frame(
        stringsAsFactors = FALSE,
        identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
        id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
        date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-17","2021-03-17","2021-03-17","2021-03-17")
        
        )

FJCC · March 28, 2022, 3:13am

Is this what you are after?

data <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
  date = c("2020-01-25","2020-01-25","2020-01-25", "2021-02-10","2021-02-10","2021-02-10", "2021-03-15","2021-03-15","2021-03-17","2021-03-17","2021-03-17","2021-03-17")
  
)
data
#>    identifer id       date
#> 1          a  1 2020-01-25
#> 2          a  2 2020-01-25
#> 3          a  3 2020-01-25
#> 4          b  1 2021-02-10
#> 5          b  2 2021-02-10
#> 6          b  3 2021-02-10
#> 7          c  1 2021-03-15
#> 8          c  2 2021-03-15
#> 9          d  1 2021-03-17
#> 10         d  2 2021-03-17
#> 11         d  3 2021-03-17
#> 12         d  4 2021-03-17
library(dplyr)

MaxID <- data %>% group_by(identifer, date) %>% summarize(id = max(id))
#> `summarise()` has grouped output by 'identifer'. You can override using the
#> `.groups` argument.
MaxID
#> # A tibble: 4 × 3
#> # Groups:   identifer [4]
#>   identifer date          id
#>   <chr>     <chr>      <int>
#> 1 a         2020-01-25     3
#> 2 b         2021-02-10     3
#> 3 c         2021-03-15     2
#> 4 d         2021-03-17     4
data2 <- semi_join(data, MaxID, by = c("identifer", "id", "date"))
data2
#>   identifer id       date
#> 1         a  3 2020-01-25
#> 2         b  3 2021-02-10
#> 3         c  2 2021-03-15
#> 4         d  4 2021-03-17

^{Created on 2022-03-27 by the reprex package (v0.2.1)}

Khoajak · March 28, 2022, 3:55am

Thank you very much FJCC and Jack. It is exactly what I want.

system · April 4, 2022, 3:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.