I'm creating a data frame and I need to delete all the rows where at least two columns have the same content (text). Empty cells ( NA ) shouldn't be considered duplicates. For example, in the following data frame, I would need to cancel only the first and the second rows.
But I have more than 10'000 rows, therefore I would need to find a code that allows me to detect the rows where some cells have the same contents and delete them. How could I do?
Another solution could be to concatenate all the 25 columns contents in one cell (per row) and ask R to delate the rows where the string in that cell has a name repeated twice.
Hope to have been the clearer as possible, in case ask me for clarification.
Perhaps something like this, though might not be that practical if you have lots of columns.
library(dplyr)
df %>%
as_tibble() %>%
mutate(duplicates = if_else(A == B | A == C | B == C,
TRUE,
FALSE)) %>%
filter(duplicates == FALSE)
# A tibble: 2 x 4
A B C duplicates
<chr> <chr> <chr> <lgl>
1 c d a FALSE
2 d e f FALSE
It works, thank you. The problem is some cells are empty (NA), therefore it tooks those as duplicates. Moreover, even if I had several columns I created all the combinations by hand because I was trying with nested for cycles but it didn't work. Would you have an idea of how to resolve those problems too? Sorry if I'm insistent, I'm an RStudio naif