I want to find the primary key (unique row identifier) within a single table, using the tidyverse.
My current method is to use dplyr::add_count()
on an accumulating number of column names, and then run filter(..., n > 1)
to discover results which yield zero rows.
library(dplyr)
library(magrittr)
set.seed(101)
# Test data set
dupes <- tibble(id = sample(letters[1:5], 20, TRUE),
date = Sys.Date() - sample(0:3, 20, TRUE),
category = sample(LETTERS[1:5], 20, TRUE),
uni = 1:20)
# This is the gist of the idea
dupes %>%
add_count(id) %>%
filter(n > 1)
# The tidy way
dupes %>%
names() %>%
accumulate(c) %>%
map(syms) %>%
map_lgl(~{
dupes %>%
add_count(!!!.x) %>%
filter(.data$n > 1) %>%
nrow() %>%
equals(0)
}) %>%
which() # this tells me how many combinations it takes to find the key
Is there an automagic way of doing this? My method does not create all possible combinations of column names, and is only a rough guess.
For datasets with a large number of columns, running such a script on all possible combinations of column names would take a considerably large amount of time to run.