Find & Identify

I need to search info in a data.frame and see if it is duplicated. Duplicated info would have the opposite name, that is to say, imagine name 1 "AAAA_AAAB" and name 2 "AAAB_AAAA", these names are duplicated. When a duplicated name is found I need the code to identify them and categorize as "Confronted_traffic". To expose my issue I present a short reprex:

name<-data.frame(stringsAsFactors=FALSE,
name = c("AAAA_AAAB", "AAAC_AAAD", "AAAD_AAAE",
"AAAB_AAAA", "AAAD_AAAC", "AAAE_AAAD",
"AAAB_AAAA")
)
#To simplify data management, I replace name by numbers
name_ID<-data.frame(stringsAsFactors=FALSE,
name_ID = c(1, 2, 3, 4, 5, 6, 7)
)
solution<-data.frame(stringsAsFactors=FALSE,
name_ID = c(1, 1, 2, 3, 4, 4, 5, 6, 7, 7),
Confronted_Traffic = c(4, 7, 5, 6, 1, 7, 2, 3, 1, 4)
)
As it can be seen in the example first I replace names by numbers just to simplify data management as I have around 96000 rows. Then, I identify duplicated data. In the example for name_ID 1, duplicated data was found twice, name_ID 4 and 7 (that is why name_ID 1 is in the first and second row, in order to match each row with its duplicated data); for name_ID 2, it was only duplicated once, name_ID 5; for name_ID 3, the duplicated data is name_ID 6; for name_ID 4 and name_ID 7 duplicated info is the same than in case name_ID 1 as it was before found; same happen in case name_ID 5 and name_ID 6.

This would be one way to do it

df <-data.frame(stringsAsFactors=FALSE,
                 name = c("AAAA_AAAB", "AAAC_AAAD", "AAAD_AAAE",
                          "AAAB_AAAA", "AAAD_AAAC", "AAAE_AAAD",
                          "AAAB_AAAA")
)

library(tidyverse)

df %>% 
    separate(name, c("first", "second"), remove = FALSE) %>% 
    mutate(name_ID = row_number(),
           Confronted_Traffic = map2(first, second, ~str_which(name, paste0(.x, "_", .y,"|", .y, "_", .x)))) %>% 
    unnest(Confronted_Traffic) %>%
    filter(name_ID != Confronted_Traffic) %>% 
    select(-first, -second)
#> # A tibble: 10 x 3
#>    name      name_ID Confronted_Traffic
#>    <chr>       <int>              <int>
#>  1 AAAA_AAAB       1                  4
#>  2 AAAA_AAAB       1                  7
#>  3 AAAC_AAAD       2                  5
#>  4 AAAD_AAAE       3                  6
#>  5 AAAB_AAAA       4                  1
#>  6 AAAB_AAAA       4                  7
#>  7 AAAD_AAAC       5                  2
#>  8 AAAE_AAAD       6                  3
#>  9 AAAB_AAAA       7                  1
#> 10 AAAB_AAAA       7                  4

Created on 2019-11-06 by the reprex package (v0.3.0.9000)

Note: Please use proper code formatting, here is how to do it

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.