Merge two dataframes based on partially matched string (but not pre-determined string)

amjass12 · July 15, 2021, 11:40am

Hi,

I have two dataframes that I need to merge. Each dataframe contains a column or row where the first 30 or so characters are similar but the characters after are different. However, the first 30 characters for each row are different throughout, they are sample names for some data so I don't have a preset partial string I can match with.

reproducible df example:

df1<-data.frame(1:4)
rownames(df1)<-c('hello5678', 'how6543', 'are9863', 'you5364')

df2<-data.frame(1:4)
rownames(df2)<-c('hello1234', 'how7643', 'are6323', 'you7496')

as you can see the first set of characters or words are common but the following numbers are different - I have 1000's of samples, so it is not feasible to make them the same one by one. I have looked everywhere but cannot find a way of merging these dataframes together based on rownames (in this reproducible example) or if they were columns.. i am trying to achieve the same thing.

Any help is appreciated. thank you.

pieterjanvc · July 15, 2021, 11:53am

Hi,

This would be possible, providing you can specify how to extract the overlapping first part of each string. In your example, the overlapping parts are all starting with the same word (which does not contain any digits), then followed by different digits. This means we can extract the parts easily using RegEx like so

library(tidyverse)

#Data
df1<-data.frame(x = 1:4)
rownames(df1)<-c('hello5678', 'how6543', 'are9863', 'you5364')
df2<-data.frame(y = 1:4)
rownames(df2)<-c('hello1234', 'how7643', 'are6323', 'you7496')

#Extract the common naming parts from row names
df1 = df1 %>% rownames_to_column() %>% 
  mutate(rowname = str_extract(rowname, "^\\D+"))

df2 = df2 %>% rownames_to_column() %>% 
  mutate(rowname = str_extract(rowname, "^\\D+"))

#Join the tables
df1 %>% full_join(df2, by = "rowname")
#>   rowname x y
#> 1   hello 1 1
#> 2     how 2 2
#> 3     are 3 3
#> 4     you 4 4

^{Created on 2021-07-15 by the reprex package (v2.0.0)}

The RegEx pattern ^\D+ means: extract from the start ^ everything that is not a digit \D+

amjass12 · July 15, 2021, 12:23pm

Hi,

thank you very much for this - is there a way (i know I need to check the help page and will be doing so now) but instead of everything that is not a digit, say, everything that is the first 30 characters regardless of whether it is a digit or not. Unfortunately these rownames are more like hello245mynameis67899 and it might be the last set of digits that differ...

thank you

pieterjanvc · July 15, 2021, 5:30pm

Hi,

Yea with RegEx you can come up with any pattern you like, as long as it is consistent between the different row names.

The regex for capturing the first 30 characters:

^.{1,30} (can be anything, also spaces or special characters)
^\w{1,30} (only letters, numbers and underscore)
...

To learn more about RegEx, check out this website. Remember in R any \ in a string needs to be escaped so that turns into \\. So the second example RegEx ^\w{1,30} becomes "^\\w{1,30}" in R

Hope this helps,
PJ

amjass12 · July 16, 2021, 1:02pm

wonderful

-thank you!

system · July 23, 2021, 1:02pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.