de-dup and keep both

tjcnnl1 · July 15, 2022, 9:28pm

I want to separate a dataset into 2, one contains duplicated rows and another one contains unique rows. One dataset only has unique rows if using distinct(id,.keep_all=T) , but lost duplicated rows. How generate 2 sets and keep both rows? Thanks in advance!

employee <- c('John','Peter','Jolie','Hot')
salary <- c(21000, 23400, 26800, 23000)
id <- c(1,2,3,1)
data <- data.frame(employee, salary, id, stringsAsFactors=FALSE)

AlexisW · July 15, 2022, 11:38pm

Can't you just make a copy?

data_with_duplicates <- data.frame(employee, salary, id, stringsAsFactors=FALSE)
data_no_duplicate <- distinct(data_with_duplicates, id, .keep_all = TRUE)

identical(data_with_duplicates, data_no_duplicate)
#> FALSE

Or if the goal is to recover the rows that were removed as duplicates:

anti_join(data_with_duplicates, data_no_duplicate, by = c("employee", "salary", "id"))
#>   employee salary id
#> 1      Hot  23000  1

nirgrahamuk · July 18, 2022, 8:45am

#example data
df_ <- data.frame(
  stringsAsFactors = FALSE,
  employee = c("John", "Peter", "Jolie", "Hot"),
  salary = c(21000, 23400, 26800, 23000),
  id = c(1, 2, 3, 1)
)

# solution
library(tidyverse)

(smry_df <- df_  |> group_by(id) |> summarise(dup=n()>1))
(split_df <- smry_df |> split(f =~ dup))
(packed_df <- map(split_df,~left_join(.x,df_,by="id") |> select(-dup)))
# as two data.frames in global rather than 2 data.frames in a list in global
(uniq_df <- packed_df[[1]])
(dup_df <- packed_df[[2]])

tjcnnl1 · July 18, 2022, 2:01pm

Yes, the goal is to recover the rows that were removed. Thank you.

mikecrobp · July 20, 2022, 11:35am

How about grouping by all columns (assuming your definition of duplicate is when all columns same)
Unique will be row number 1 within each group. Duplicates are rows > 1

  df= packed_df%>%
    group_by_all() %>%
    mutate(rownumber = row_number())
  
  uniq_df = df%>%
    filter(rownumber == 1)
  
  dup_df = df%>%
    filter(rownumber > 1)

system · July 27, 2022, 11:36am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.