Detect and delate rows where at least two columns have the same content

snoozecj · August 28, 2021, 8:35pm

Hi All,

I'm creating a data frame and I need to delete all the rows where at least two columns have the same content (text). Empty cells ( NA ) shouldn't be considered duplicates. For example, in the following data frame, I would need to cancel only the first and the second rows.

df = cbind(A = c('a', 'b', 'c', 'd','NA'),
                 B = c('a', 'c', 'd', 'e','g'), C = c('e', 'b', 'a', 'f','NA'))

df <- data.frame (df)

But I have more than 10'000 rows, therefore I would need to find a code that allows me to detect the rows where some cells have the same contents and delete them. How could I do?

Another solution could be to concatenate all the 25 columns contents in one cell (per row) and ask R to delate the rows where the string in that cell has a name repeated twice.

Hope to have been the clearer as possible, in case ask me for clarification.

Many thanks

williaml · August 28, 2021, 10:13pm

Perhaps something like this, though might not be that practical if you have lots of columns.

library(dplyr)
df %>% 
  as_tibble() %>% 
  mutate(duplicates = if_else(A == B | A == C | B == C,
                              TRUE,
                              FALSE)) %>% 
  filter(duplicates == FALSE)

# A tibble: 2 x 4
  A     B     C     duplicates
  <chr> <chr> <chr> <lgl>     
1 c     d     a     FALSE     
2 d     e     f     FALSE

technocrat · August 28, 2021, 10:26pm

Be sure to create a data frame or tibble, not just use cbind.

snoozecj · August 29, 2021, 12:37am

It works, thank you. The problem is some cells are empty (NA), therefore it tooks those as duplicates. Moreover, even if I had several columns I created all the combinations by hand because I was trying with nested for cycles but it didn't work. Would you have an idea of how to resolve those problems too? Sorry if I'm insistent, I'm an RStudio naif

nirgrahamuk · August 29, 2021, 8:16am

It would be best if your example reflected your challenge in this way. Could you update it?

nirgrahamuk · August 29, 2021, 12:24pm


df <- data.frame (A = c('a', 'b', 'c', 'd',NA),
                  B = c('a', 'c', 'd', 'e','g'), C = c('e', 'b', 'a', 'f',NA))

#get vector of names of variables in df

(nmv <- names(df))

#get pairs of names
(name_pairs <- combn(nmv,2,simplify = FALSE))

#build list of filter comparisons
library(purrr)
library(glue)
library(rlang)
(filters_1 <- map_chr(name_pairs,~glue("({.x[1]}!={.x[2]} | is.na({.x[1]}))")))
#combin into a single filter rule
filters_2 <- parse_exprs(paste0(filters_1 , collapse = " & "))[[1]]

#apply the rule
df %>% filter(eval(filters_2))

snoozecj · August 29, 2021, 12:37pm

But in this way, I'm going to lose the rows with NAs, while I want to keep them if there are no other duplicates

snoozecj · August 29, 2021, 12:40pm

Sorry, my bad. I didn't notice I wrote the opposite. By the way, empty cells (NA) shouldn't be considered as duplicates for the aim of my research.

nirgrahamuk · August 29, 2021, 1:03pm

ok, i think i understood that I was just sloppy.
extend the inner rules like so:

(filters_1 <- map_chr(name_pairs,~glue("({.x[1]}!={.x[2]} | is.na({.x[1]}) | is.na({.x[2]}))")))

system · September 5, 2021, 1:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.