I have a dataframe that I need to remove duplicates based on the variable "e-mail". However, there's a lot of NA's there that I cannot get rid of because they're valuable observations. Besides NA's, some people happened to put a dot in it, so I want to know if I can get rid of the rows with duplicated e-mails while ignoring NA's and the observations with "." on the email.
I've tried distinct()
and n_distinct()
but both of these don't have a na.rm option.
Here's an example of what i mean:
library(dplyr)
email <- c("xxx@xxx.xxx","xxx@xxx.xxx","yyy@yyy.yyy","yyy@yyy.yyy","zzz@zzz.zzz","zzz@zzz.zzz",".",".",".",".",".")
names <- c("Gabriel","Marcos","Julio","Rafael","Victor","Azymov","Turkey Sandvich","Marzia","Door","Cato","Doggo")
test <- data.frame(email,names)
morenames <- c("Soap","Redbull","World of Warcraft")
moreemails <- c(NA,NA,NA)
test2 <- data.frame(moreemails, morenames)
names(test2) <- c("email","names")
test <- test %>% rbind(test2)
test
verif_dup <- test[duplicated(test[,1]),]
verif_dup
I can see all the duplicate emails on verif_dup. I want a way to remove the duplicates like xxx@xxx.xxx, yyy@yyy.yyy and zzz@zzz.zzz, but keep the "." and NA's.