Remove duplicates based on a variable but ignoring NA's

afmannew · September 15, 2021, 6:37pm

I have a dataframe that I need to remove duplicates based on the variable "e-mail". However, there's a lot of NA's there that I cannot get rid of because they're valuable observations. Besides NA's, some people happened to put a dot in it, so I want to know if I can get rid of the rows with duplicated e-mails while ignoring NA's and the observations with "." on the email.

I've tried distinct() and n_distinct() but both of these don't have a na.rm option.

Here's an example of what i mean:

library(dplyr)
email <- c("xxx@xxx.xxx","xxx@xxx.xxx","yyy@yyy.yyy","yyy@yyy.yyy","zzz@zzz.zzz","zzz@zzz.zzz",".",".",".",".",".")
names <- c("Gabriel","Marcos","Julio","Rafael","Victor","Azymov","Turkey Sandvich","Marzia","Door","Cato","Doggo")
test <- data.frame(email,names)
morenames <- c("Soap","Redbull","World of Warcraft")
moreemails <- c(NA,NA,NA)
test2 <- data.frame(moreemails, morenames)
names(test2) <- c("email","names")
test <- test %>% rbind(test2)
test
verif_dup <- test[duplicated(test[,1]),]
verif_dup

I can see all the duplicate emails on verif_dup. I want a way to remove the duplicates like xxx@xxx.xxx, yyy@yyy.yyy and zzz@zzz.zzz, but keep the "." and NA's.

HanOostdijk · September 16, 2021, 7:49am

Of course a reprex would help YOU here:

it makes it more convenient for the reader to help you out:
no need to create test data
it shows that you think your issue so important, that you took the trouble to create a reprex: an incensive for some people to help you

afmannew · September 20, 2021, 5:43pm

Edited the topic with some fictional data to represent what i mean

HanOostdijk · September 20, 2021, 8:01pm

For this data you could do the following:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
email <- c("xxx@xxx.xxx","xxx@xxx.xxx","yyy@yyy.yyy","yyy@yyy.yyy",
           "zzz@zzz.zzz","zzz@zzz.zzz",".",".",".",".",".")
names <- c("Gabriel","Marcos","Julio","Rafael","Victor","Azymov",
           "Turkey Sandvich","Marzia","Door","Cato","Doggo")
test <- data.frame(email,names)
morenames <- c("Soap","Redbull","World of Warcraft")
moreemails <- c(NA,NA,NA)
test2 <- data.frame(moreemails, morenames)
names(test2) <- c("email","names")
test <- test %>% rbind(test2)

dfkeep <-test %>% 
  group_by(email) %>%
  mutate (keep = case_when(
    is.na(email) ~ T , # keeps all NA
    email == "." ~ T,  # keeps all character '.' (missing in some languages)
    row_number() ==1 ~ T, # keeps all first of (other) duplicates
    T ~ F
  ))
print(dfkeep)
#> # A tibble: 14 x 3
#> # Groups:   email [5]
#>    email       names             keep 
#>    <chr>       <chr>             <lgl>
#>  1 xxx@xxx.xxx Gabriel           TRUE 
#>  2 xxx@xxx.xxx Marcos            FALSE
#>  3 yyy@yyy.yyy Julio             TRUE 
#>  4 yyy@yyy.yyy Rafael            FALSE
#>  5 zzz@zzz.zzz Victor            TRUE 
#>  6 zzz@zzz.zzz Azymov            FALSE
#>  7 .           Turkey Sandvich   TRUE 
#>  8 .           Marzia            TRUE 
#>  9 .           Door              TRUE 
#> 10 .           Cato              TRUE 
#> 11 .           Doggo             TRUE 
#> 12 <NA>        Soap              TRUE 
#> 13 <NA>        Redbull           TRUE 
#> 14 <NA>        World of Warcraft TRUE
dfkeep %>%
  filter(keep ==T) %>%
  select(-keep) %>%
  ungroup()
#> # A tibble: 11 x 2
#>    email       names            
#>    <chr>       <chr>            
#>  1 xxx@xxx.xxx Gabriel          
#>  2 yyy@yyy.yyy Julio            
#>  3 zzz@zzz.zzz Victor           
#>  4 .           Turkey Sandvich  
#>  5 .           Marzia           
#>  6 .           Door             
#>  7 .           Cato             
#>  8 .           Doggo            
#>  9 <NA>        Soap             
#> 10 <NA>        Redbull          
#> 11 <NA>        World of Warcraft
Created on 2021-09-20 by the reprex package (v2.0.0)

system · October 11, 2021, 8:02pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.