Hi,
I have a data-frame with 300k rows i wish to dedup.
A duplicate is considered based on a pair. So for example in the below, I would only want the first instance of the duplicate. In this case A-C is the same thing as C-A. Does anyone know a way of identifying this?
library(tidyverse)
mydf <- tibble::tribble(
~Col_A, ~Col_B,
"A", "C",
"B", "B",
"A", "C",
"A", "C",
"C", "A"
)
My final result here should be something like
A-C
B-B
Thanks very much for your time
John
Well, you need to normalize and since you are working with letters, you can sort it when creating a key. So, something like this would work:
library(tidyverse)
mydf <- tibble::tribble(
~Col_A, ~Col_B,
"A", "C",
"B", "B",
"A", "C",
"A", "C",
"C", "A"
) %>%
dplyr::mutate(normalized = purrr::map2_chr(Col_A, Col_B, ~paste(sort(c(.x, .y)), collapse = ""))) %>%
dplyr::group_by(normalized) %>%
dplyr::summarise(Col_A = dplyr::first(Col_A),
Col_B = dplyr::first(Col_B)) %>%
dplyr::select(-normalized)
mydf
#> # A tibble: 2 x 2
#> Col_A Col_B
#> <chr> <chr>
#> 1 A C
#> 2 B B
Created on 2019-08-01 by the reprex package (v0.3.0)
With 300k rows it still won't be too slow, I think.
2 Likes
mishabalyasin:
library(tidyverse) mydf <- tibble::tribble( ~Col_A, ~Col_B, "A", "C", "B", "B", "A", "C", "A", "C", "C", "A" ) %>% dplyr::mutate(normalized = purrr::map2_chr(Col_A, Col_B, ~paste(sort(c(.x, .y)), collapse = ""))) %>% dplyr::group_by(normalized) %>% dplyr::summarise(Col_A = dplyr::first(Col_A), Col_B = dplyr::first(Col_B)) %>% dplyr::select(-normalized) mydf
This is very clever, Thank you very much
system
Closed
August 8, 2019, 4:44pm
4
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.