Remove duplicates based on pairs

john.smith · August 1, 2019, 4:06pm

Hi,

I have a data-frame with 300k rows i wish to dedup.
A duplicate is considered based on a pair. So for example in the below, I would only want the first instance of the duplicate. In this case A-C is the same thing as C-A. Does anyone know a way of identifying this?

library(tidyverse)
mydf <- tibble::tribble(
  ~Col_A, ~Col_B,
     "A",    "C",
     "B",    "B",
     "A",    "C",
     "A",    "C",
     "C",    "A"
  )

My final result here should be something like

A-C
B-B

Thanks very much for your time
John

mishabalyasin · August 1, 2019, 4:40pm

Well, you need to normalize and since you are working with letters, you can sort it when creating a key. So, something like this would work:

library(tidyverse)
mydf <- tibble::tribble(
  ~Col_A, ~Col_B,
  "A",    "C",
  "B",    "B",
  "A",    "C",
  "A",    "C",
  "C",    "A"
) %>%
  dplyr::mutate(normalized = purrr::map2_chr(Col_A, Col_B, ~paste(sort(c(.x, .y)), collapse = ""))) %>%
  dplyr::group_by(normalized) %>%
  dplyr::summarise(Col_A = dplyr::first(Col_A),
                   Col_B = dplyr::first(Col_B)) %>%
  dplyr::select(-normalized)

mydf
#> # A tibble: 2 x 2
#>   Col_A Col_B
#>   <chr> <chr>
#> 1 A     C    
#> 2 B     B

^{Created on 2019-08-01 by the reprex package (v0.3.0)}

With 300k rows it still won't be too slow, I think.

john.smith · August 1, 2019, 4:43pm

This is very clever, Thank you very much

system · August 8, 2019, 4:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.