How to remove common strings between 2 variables

Wkp · March 23, 2022, 2:13pm

I need to create a third variable deleting the common strings between the first two variables:
Can you please help with an easy function?

a<- data.frame(V1= c("carlos rodrigo", "sarah", "patricia  raquel", "leonardo"), V2= c("rodrigo", "patri", "raquel", "oscar leonardo"), 
               Result = c( "carlos", "patri sarah", "patricia", "oscar"))

^{Created on 2022-03-23 by the reprex package (v2.0.1)}

nirgrahamuk · March 23, 2022, 2:50pm

There are probably many ways to do it. Here is one.

a<- data.frame(V1= c("carlos rodrigo", "sarah", "patricia  raquel", "leonardo"), V2= c("rodrigo", "patri", "raquel", "oscar leonardo"), 
               Result = c( "carlos", "patri sarah", "patricia", "oscar"))
library(tidyverse)
a %>% rowwise() %>% 
  mutate(common_content = list(intersect(x=str_split(V1," ",simplify = TRUE),
                               y=str_split(V2," ",simplify = TRUE))),
         V1_unique = list(setdiff(str_split(V1," ",simplify = TRUE),common_content)),
         V2_unique = list(setdiff(str_split(V2," ",simplify = TRUE),common_content)),
         result_by_code = trimws(paste(V2_unique,V1_unique,collapse="")))

StatSteph · March 23, 2022, 3:42pm

And here's another way:

library(tidyverse)

a<- data.frame(V1= c("carlos rodrigo", "sarah", "patricia  raquel", "leonardo"), V2= c("rodrigo", "patri", "raquel", "oscar leonardo"), 
               Result = c( "carlos", "patri sarah", "patricia", "oscar"))

symdiff <- function( x, y) { setdiff( union(x, y), intersect(x, y))}

a %>%
  rowwise() %>%
  mutate(
    V1sp=str_split(V1, "\\s+"),
    V2sp=str_split(V2, "\\s+"),
    Result2=str_c(symdiff(V1sp, V2sp), collapse = " ")
  )
#> # A tibble: 4 × 6
#> # Rowwise: 
#>   V1               V2             Result      V1sp      V2sp      Result2    
#>   <chr>            <chr>          <chr>       <list>    <list>    <chr>      
#> 1 carlos rodrigo   rodrigo        carlos      <chr [2]> <chr [1]> carlos     
#> 2 sarah            patri          patri sarah <chr [1]> <chr [1]> sarah patri
#> 3 patricia  raquel raquel         patricia    <chr [2]> <chr [1]> patricia   
#> 4 leonardo         oscar leonardo oscar       <chr [1]> <chr [2]> oscar

^{Created on 2022-03-23 by the reprex package (v2.0.1)}

Wkp · March 30, 2022, 9:38am

Thank you very much all people sent a solution. Do you know where I can start to learn UDFs? and how to apply that? many thanks, magnificent solution.

I also think it would be possible make it in the way below:

library(tidyverse)

a<- data.frame(V1= c("carlos rodrigo", "sarah", "patricia  raquel", "leonardo"), V2= c("rodrigo", "patri", "raquel", "oscar leonardo"), 
               Result = c( "carlos", "patri sarah", "patricia", "oscar"))

symdiff <- function( x, y) { setdiff( union(x, y), intersect(x, y))}

a %>%
  rowwise() %>%
  mutate(
    V1sp=strsplit((tolower(V1), " "), 
    V2sp=strsplit((tolower(V2), " "),
    Result2=str_c(symdiff(V1sp, V2sp), collapse = " ")
  )

nirgrahamuk · March 30, 2022, 1:30pm

The chapter on functions in R4DS is here : 19 Functions | R for Data Science (had.co.nz)
Also I remember that package 'swirl' interactive R lessons had good coverage of functions.
swirl | Students (swirlstats.com)

StatSteph · March 30, 2022, 4:00pm

You have some extra parentheses. There's one difference with using " " instead of "\s+" - look at row 3 of the results. patricia has a trailing space. because "patricia raquel" has two spaces in the middle.

library(tidyverse)

a<- data.frame(V1= c("carlos rodrigo", "sarah", "patricia  raquel", "leonardo"), V2= c("rodrigo", "patri", "raquel", "oscar leonardo"), 
               Result = c( "carlos", "patri sarah", "patricia", "oscar"))

symdiff <- function( x, y) { setdiff( union(x, y), intersect(x, y))}

a %>%
  rowwise() %>%
  mutate(
    V1sp=strsplit(tolower(V1), " "), 
    V2sp=strsplit(tolower(V2), " "),
    Result2=str_c(symdiff(V1sp, V2sp), collapse = " ")
  )
#> # A tibble: 4 x 6
#> # Rowwise: 
#>   V1               V2             Result      V1sp      V2sp      Result2      
#>   <chr>            <chr>          <chr>       <list>    <list>    <chr>        
#> 1 carlos rodrigo   rodrigo        carlos      <chr [2]> <chr [1]> "carlos"     
#> 2 sarah            patri          patri sarah <chr [1]> <chr [1]> "sarah patri"
#> 3 patricia  raquel raquel         patricia    <chr [3]> <chr [1]> "patricia "  
#> 4 leonardo         oscar leonardo oscar       <chr [1]> <chr [2]> "oscar"

^{Created on 2022-03-30 by the reprex package (v2.0.1)}

system · April 6, 2022, 4:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.