find matched elements row by row fast

veda · April 10, 2020, 3:53am

Hi R experts,

I want to extract matching elements between column x and column y and here is my code:
data=data.frame(x=c('xdcff','dfghj','erbmp'),y=c('aaaa','dvbgg','tg'))
data$x=as.character(data$x)
data$y=as.character(data$y)
data$m=0
for (i in 1:nrow(data)) {
if (nchar(as.character(data$x[i]))>1) {
data$m[i]=paste(intersect(strsplit(data$x[i],split='')[[1]],strsplit(data$y[i],split='')[[1]]),collapse = '') }}

data$m is the result I want. Besides, the to-be-compared strings could be Chinese characters. So, the split function is needed.
The thing is I got 500 thousands rows and it took like forever to run the loop. I appreciate it if you could share other ways to do it fast.

Best,
Veda

Rafael.F · April 10, 2020, 5:05am

Hi Veda,
Use apply family instead

First, create a function to do the same

this_function <- function(x,y){
if (nchar(as.character(x))>1) {
m =paste(intersect(strsplit(x,split='')[[1]],strsplit(y,split='')[[1]]),collapse = '')
return(m) }
}

Then apply 'this_function' to vectors : data$x and data$y.

data$m <- mapply(this_function, data$x, data$y)

Please tell me if it reduces your time

veda · April 10, 2020, 6:38am

It worked super fast. Thanks Rafael.

andresrcs · April 10, 2020, 1:31pm

This would be a tidyverse based solution

library(tidyverse)

data <-  data.frame(stringsAsFactors = FALSE,
                  x = c('xdcff','dfghj','erbmp'),
                  y = c('aaaa','dvbgg','tg'),
                  m = 0)

data %>%
    rowwise() %>% 
    mutate(m = paste(intersect(str_split(x, pattern = "", simplify = TRUE),
                               str_split(y, pattern = "", simplify = TRUE)),
                     collapse = "")
           ) %>% 
    ungroup()
#> # A tibble: 3 x 3
#>   x     y     m    
#>   <chr> <chr> <chr>
#> 1 xdcff aaaa  ""   
#> 2 dfghj dvbgg "dg" 
#> 3 erbmp tg    ""

^{Created on 2020-04-10 by the reprex package (v0.3.0.9001)}

system · April 17, 2020, 1:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.