How to merge two dataframes in R with more than one gene identifier?

mtoufiq · May 6, 2021, 10:03am

Hi,

I have two questions about the dataframe operations in R:

Merging two dataframe in R based on the "Gene symbol identifier". How does the merging works when there is more than one gene symbol in the column separated by /// or ..... ? Which Gene Symbol gets picked by merging. For instance DDR1 /// MIR4640 or in another format DDR1 ..... MIR4640
Removing the rows with multiple genes symbols. Example of data provided below:

dput(head(GSE26378_data.matrix.nor))

structure(c(0.6756, 2.0564, 0.990533333333333, 1.252, 0.996266666666667, 
0.7107, 0.9596, 1.7188, 0.925, 1.2134, 1.06933333333333, 1.75393333333333, 
0.936866666666667, 1.52126666666667, 0.341266666666667, 1.21433333333333, 
1.09193333333333, 0.822666666666667, 1.40113333333333, 1.26673333333333, 
1.31053333333333, 1.10506666666667, 1.06953333333333, 0.541766666666667, 
0.815333333333333, 0.831466666666667, 0.4028, 1.25846666666667, 
0.922066666666667, 0.6774, 0.964866666666667, 0.7709, 0.404933333333333, 
1.19093333333333, 0.9622, 0.655766666666667, 0.772466666666667, 
1.1494, 0.246266666666667, 1.2376, 1.10593333333333, 0.627466666666667, 
0.902933333333333, 0.8008, 1.017, 1.21596666666667, 1.10786666666667, 
0.6652, 0.8872, 0.783, 0.749933333333333, 1.27313333333333, 0.908533333333333, 
0.6245, 0.659733333333333, 0.946666666666667, 1.77033333333333, 
0.923133333333333, 1.01566666666667, 1.21173333333333, 0.6548, 
1.39953333333333, 1.08426666666667, 0.827766666666667, 1.14753333333333, 
0.851066666666667, 0.482133333333333, 0.978033333333333, 1.9255, 
0.6996, 1.12153333333333, 1.00046666666667, 1.19473333333333, 
0.576266666666667, 0.282266666666667, 1.31666666666667, 1.00833333333333, 
1.02413333333333, 0.780933333333333, 1.15693333333333, 0.272533333333333, 
1.17853333333333, 1.05426666666667, 0.518533333333333, 1.01946666666667, 
0.8346, 0.5708, 1.49526666666667, 1.07893333333333, 0.673), .Dim = c(6L, 
15L), .Dimnames = list(c("DDR1 /// MIR4640", "RFC2", "HSPA6", 
"PAX8", "GUCA1A", "MIR5193 /// UBA7"), c("GSM647547", "GSM647552", 
"GSM647553", "GSM647565", "GSM647569", "GSM647574", "GSM647577", 
"GSM647580", "GSM647560", "GSM647619", "GSM647550", "GSM647528", 
"GSM647537", "GSM647616", "GSM647626")))

Thank you,

Toufiq

pieterjanvc · May 6, 2021, 12:46pm

Hi,

Welcome to the RStudio community!

Here is an example on how you can clean your data and join it with new data

library(tidyverse)

myData = structure(c(0.6756, 2.0564, 0.990533333333333, 1.252, 0.996266666666667, 
            0.7107, 0.9596, 1.7188, 0.925, 1.2134, 1.06933333333333, 1.75393333333333, 
            0.936866666666667, 1.52126666666667, 0.341266666666667, 1.21433333333333, 
            1.09193333333333, 0.822666666666667, 1.40113333333333, 1.26673333333333, 
            1.31053333333333, 1.10506666666667, 1.06953333333333, 0.541766666666667, 
            0.815333333333333, 0.831466666666667, 0.4028, 1.25846666666667, 
            0.922066666666667, 0.6774, 0.964866666666667, 0.7709, 0.404933333333333, 
            1.19093333333333, 0.9622, 0.655766666666667, 0.772466666666667, 
            1.1494, 0.246266666666667, 1.2376, 1.10593333333333, 0.627466666666667, 
            0.902933333333333, 0.8008, 1.017, 1.21596666666667, 1.10786666666667, 
            0.6652, 0.8872, 0.783, 0.749933333333333, 1.27313333333333, 0.908533333333333, 
            0.6245, 0.659733333333333, 0.946666666666667, 1.77033333333333, 
            0.923133333333333, 1.01566666666667, 1.21173333333333, 0.6548, 
            1.39953333333333, 1.08426666666667, 0.827766666666667, 1.14753333333333, 
            0.851066666666667, 0.482133333333333, 0.978033333333333, 1.9255, 
            0.6996, 1.12153333333333, 1.00046666666667, 1.19473333333333, 
            0.576266666666667, 0.282266666666667, 1.31666666666667, 1.00833333333333, 
            1.02413333333333, 0.780933333333333, 1.15693333333333, 0.272533333333333, 
            1.17853333333333, 1.05426666666667, 0.518533333333333, 1.01946666666667, 
            0.8346, 0.5708, 1.49526666666667, 1.07893333333333, 0.673), .Dim = c(6L, 
                                                                                 15L), 
            .Dimnames = list(c("DDR1 /// MIR4640", "RFC2", "HSPA6",
                               "PAX8", "GUCA1A", "MIR5193 /// UBA7"), 
                             c("GSM647547", "GSM647552", "GSM647553", "GSM647565", "GSM647569", 
                               "GSM647574", "GSM647577", "GSM647580", "GSM647560", "GSM647619", 
                               "GSM647550", "GSM647528", "GSM647537", "GSM647616", "GSM647626")))

#Transform into data frame and clean genes
myData = myData  %>% as.data.frame() %>% rownames_to_column("gene") %>% 
  rowwise() %>%
  mutate(gene = str_split(gene, " /// ", simplify = T) %>% 
           sort() %>% paste(collapse = "_"))
myData[,1:3]
#> # A tibble: 6 x 3
#> # Rowwise: 
#>   gene         GSM647547 GSM647552
#>   <chr>            <dbl>     <dbl>
#> 1 DDR1_MIR4640     0.676     0.960
#> 2 RFC2             2.06      1.72 
#> 3 HSPA6            0.991     0.925
#> 4 PAX8             1.25      1.21 
#> 5 GUCA1A           0.996     1.07 
#> 6 MIR5193_UBA7     0.711     1.75

#new data
newData = data.frame(
  gene = c("PAX8", "MIR5193 ... UBA7", "HSPA6", "CFTR", "MIR4640 ... DDR1"),
  newVal1 = runif(5),
  newVal2 = runif(5)
)
newData
#>               gene   newVal1   newVal2
#> 1             PAX8 0.3042653 0.7143048
#> 2 MIR5193 ... UBA7 0.3860882 0.7852422
#> 3            HSPA6 0.6428005 0.7612599
#> 4             CFTR 0.2792017 0.5716758
#> 5 MIR4640 ... DDR1 0.9572771 0.2558074

#Transform to match the existing one
newData = newData %>% rowwise() %>% 
  mutate(gene = str_split(gene, " ... ", simplify = T) %>% 
           sort() %>% paste(collapse = "_"))
newData
#> # A tibble: 5 x 3
#> # Rowwise: 
#>   gene         newVal1 newVal2
#>   <chr>          <dbl>   <dbl>
#> 1 PAX8           0.304   0.714
#> 2 MIR5193_UBA7   0.386   0.785
#> 3 HSPA6          0.643   0.761
#> 4 CFTR           0.279   0.572
#> 5 DDR1_MIR4640   0.957   0.256

#Join
result = myData %>% full_join(newData, by = "gene")

#Result
result[c(1:2, 15:18)]
#> # A tibble: 7 x 6
#> # Rowwise: 
#>   gene         GSM647547 GSM647616 GSM647626 newVal1 newVal2
#>   <chr>            <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
#> 1 DDR1_MIR4640     0.676     0.781     1.02    0.957   0.256
#> 2 RFC2             2.06      1.16      0.835  NA      NA    
#> 3 HSPA6            0.991     0.273     0.571   0.643   0.761
#> 4 PAX8             1.25      1.18      1.50    0.304   0.714
#> 5 GUCA1A           0.996     1.05      1.08   NA      NA    
#> 6 MIR5193_UBA7     0.711     0.519     0.673   0.386   0.785
#> 7 CFTR            NA        NA        NA       0.279   0.572

^{Created on 2021-05-06 by the reprex package (v2.0.0)}

By making sure you format the "gene" column the same for every dataset, you can then simply join by it. I did this by splitting the name of any column with multiple genes by its divider, then sorting it and pasting it back together with "_". This way you can transform any gene with multiple entries into a consistent one.

Hope this helps,
PJ

mtoufiq · May 6, 2021, 1:32pm

@pieterjanvc, thank you so much for the suggestions. It will be indeed helpful. However, it would be good and makes sense if they are separated by a row with same data points rather than joining with "_" since they belong to the same feature/probe. Is there a way to achieve this?

dput(myData_v1)

structure(list(gene = c("DDR1", "MIR4640", "RFC2", "HSPA6", "PAX8", 
"GUCA1A", "MIR5193", "UBA7"), GSM647547 = c(0.6756, 0.6756, 2.0564, 
0.990533333, 1.252, 0.996266667, 0.7107, 0.7107), GSM647552 = c(0.9596, 
0.9596, 1.7188, 0.925, 1.2134, 1.069333333, 1.753933333, 1.753933333
), GSM647553 = c(0.936866667, 0.936866667, 1.521266667, 0.341266667, 
1.214333333, 1.091933333, 0.822666667, 0.822666667), GSM647565 = c(1.401133333, 
1.401133333, 1.266733333, 1.310533333, 1.105066667, 1.069533333, 
0.541766667, 0.541766667), GSM647569 = c(0.815333333, 0.815333333, 
0.831466667, 0.4028, 1.258466667, 0.922066667, 0.6774, 0.6774
), GSM647574 = c(0.964866667, 0.964866667, 0.7709, 0.404933333, 
1.190933333, 0.9622, 0.655766667, 0.655766667), GSM647577 = c(0.772466667, 
0.772466667, 1.1494, 0.246266667, 1.2376, 1.105933333, 0.627466667, 
0.627466667), GSM647580 = c(0.902933333, 0.902933333, 0.8008, 
1.017, 1.215966667, 1.107866667, 0.6652, 0.6652), GSM647560 = c(0.8872, 
0.8872, 0.783, 0.749933333, 1.273133333, 0.908533333, 0.6245, 
0.6245), GSM647619 = c(0.659733333, 0.659733333, 0.946666667, 
1.770333333, 0.923133333, 1.015666667, 1.211733333, 1.211733333
), GSM647550 = c(0.6548, 0.6548, 1.399533333, 1.084266667, 0.827766667, 
1.147533333, 0.851066667, 0.851066667), GSM647528 = c(0.482133333, 
0.482133333, 0.978033333, 1.9255, 0.6996, 1.121533333, 1.000466667, 
1.000466667), GSM647537 = c(1.194733333, 1.194733333, 0.576266667, 
0.282266667, 1.316666667, 1.008333333, 1.024133333, 1.024133333
), GSM647616 = c(0.780933333, 0.780933333, 1.156933333, 0.272533333, 
1.178533333, 1.054266667, 0.518533333, 0.518533333), GSM647626 = c(1.019466667, 
1.019466667, 0.8346, 0.5708, 1.495266667, 1.078933333, 0.673, 
0.673)), class = "data.frame", row.names = c(NA, -8L))

pieterjanvc · May 6, 2021, 8:44pm

Hi,

Sure that's an even easier option with the separate_rows() function

library(tidyverse)

myData = structure(c(0.6756, 2.0564, 0.990533333333333, 1.252, 0.996266666666667, 
            0.7107, 0.9596, 1.7188, 0.925, 1.2134, 1.06933333333333, 1.75393333333333, 
            0.936866666666667, 1.52126666666667, 0.341266666666667, 1.21433333333333, 
            1.09193333333333, 0.822666666666667, 1.40113333333333, 1.26673333333333, 
            1.31053333333333, 1.10506666666667, 1.06953333333333, 0.541766666666667, 
            0.815333333333333, 0.831466666666667, 0.4028, 1.25846666666667, 
            0.922066666666667, 0.6774, 0.964866666666667, 0.7709, 0.404933333333333, 
            1.19093333333333, 0.9622, 0.655766666666667, 0.772466666666667, 
            1.1494, 0.246266666666667, 1.2376, 1.10593333333333, 0.627466666666667, 
            0.902933333333333, 0.8008, 1.017, 1.21596666666667, 1.10786666666667, 
            0.6652, 0.8872, 0.783, 0.749933333333333, 1.27313333333333, 0.908533333333333, 
            0.6245, 0.659733333333333, 0.946666666666667, 1.77033333333333, 
            0.923133333333333, 1.01566666666667, 1.21173333333333, 0.6548, 
            1.39953333333333, 1.08426666666667, 0.827766666666667, 1.14753333333333, 
            0.851066666666667, 0.482133333333333, 0.978033333333333, 1.9255, 
            0.6996, 1.12153333333333, 1.00046666666667, 1.19473333333333, 
            0.576266666666667, 0.282266666666667, 1.31666666666667, 1.00833333333333, 
            1.02413333333333, 0.780933333333333, 1.15693333333333, 0.272533333333333, 
            1.17853333333333, 1.05426666666667, 0.518533333333333, 1.01946666666667, 
            0.8346, 0.5708, 1.49526666666667, 1.07893333333333, 0.673), .Dim = c(6L, 
                                                                                 15L), 
            .Dimnames = list(c("DDR1 /// MIR4640", "RFC2", "HSPA6",
                               "PAX8", "GUCA1A", "MIR5193 /// UBA7"), 
                             c("GSM647547", "GSM647552", "GSM647553", "GSM647565", "GSM647569", 
                               "GSM647574", "GSM647577", "GSM647580", "GSM647560", "GSM647619", 
                               "GSM647550", "GSM647528", "GSM647537", "GSM647616", "GSM647626")))

#Transform into data frame
myData = myData  %>% as.data.frame() %>% rownames_to_column("gene") %>% 
  separate_rows(gene, sep = " /// ")
myData[,1:3]
#> # A tibble: 8 x 3
#>   gene    GSM647547 GSM647552
#>   <chr>       <dbl>     <dbl>
#> 1 DDR1        0.676     0.960
#> 2 MIR4640     0.676     0.960
#> 3 RFC2        2.06      1.72 
#> 4 HSPA6       0.991     0.925
#> 5 PAX8        1.25      1.21 
#> 6 GUCA1A      0.996     1.07 
#> 7 MIR5193     0.711     1.75 
#> 8 UBA7        0.711     1.75

#new data
newData = data.frame(
  gene = c("PAX8", "MIR5193 ... UBA7", "HSPA6", "CFTR", "MIR4640 ... DDR1"),
  newVal1 = runif(5),
  newVal2 = runif(5)
)
newData
#>               gene   newVal1   newVal2
#> 1             PAX8 0.9862127 0.2955867
#> 2 MIR5193 ... UBA7 0.8535393 0.4826889
#> 3            HSPA6 0.1682033 0.7512090
#> 4             CFTR 0.7692927 0.7765908
#> 5 MIR4640 ... DDR1 0.2215456 0.2596525

#Transform to match the existing one
newData = newData %>% rowwise() %>% 
  separate_rows(gene, sep = " ... ")
newData
#> # A tibble: 7 x 3
#>   gene    newVal1 newVal2
#>   <chr>     <dbl>   <dbl>
#> 1 PAX8      0.986   0.296
#> 2 MIR5193   0.854   0.483
#> 3 UBA7      0.854   0.483
#> 4 HSPA6     0.168   0.751
#> 5 CFTR      0.769   0.777
#> 6 MIR4640   0.222   0.260
#> 7 DDR1      0.222   0.260

#Join
result = myData %>% full_join(newData, by = "gene")

#Result
result[c(1:2, 15:18)]
#> # A tibble: 9 x 6
#>   gene    GSM647547 GSM647616 GSM647626 newVal1 newVal2
#>   <chr>       <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
#> 1 DDR1        0.676     0.781     1.02    0.222   0.260
#> 2 MIR4640     0.676     0.781     1.02    0.222   0.260
#> 3 RFC2        2.06      1.16      0.835  NA      NA    
#> 4 HSPA6       0.991     0.273     0.571   0.168   0.751
#> 5 PAX8        1.25      1.18      1.50    0.986   0.296
#> 6 GUCA1A      0.996     1.05      1.08   NA      NA    
#> 7 MIR5193     0.711     0.519     0.673   0.854   0.483
#> 8 UBA7        0.711     0.519     0.673   0.854   0.483
#> 9 CFTR       NA        NA        NA       0.769   0.777

^{Created on 2021-05-06 by the reprex package (v2.0.0)}

Hope this helps,
PJ

mtoufiq · May 7, 2021, 1:14am

@pieterjanvc, thank you so much. This is helpful and solved.

I have another question, if there are rows with multiple genes symbols in unusual form separated by dots (for instance, GeneA..GeneB..GeneC). How can I remove this from the dataframe for further processing?

dput(Data)
structure(list(Sample_1 = c(0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6
), Sample_2 = c(1.15, 1.15, 1.15, 1.15, 0.6, 0.6, 0.6), Sample_3 = c(0.6, 
0.6, 0.6, 0.7, 0.7, 0.7, 0.7)), class = "data.frame", row.names = c("GeneE", 
"GeneF", "GeneK", "GeneM", "GeneA..GeneB..GeneC", "GeneXX", "GeneX..GeneY..GeneYY"
))

pieterjanvc · May 7, 2021, 11:59am

Do you mean removing the whole row, or just the extra genes in the name e.g. GeneA..GeneB..GeneC --> GeneA

mtoufiq · May 8, 2021, 2:14am

@pieterjanvc, It would be great, if you could let me know both ways; this will help me in testing and applying appropriately.

remove whole row, and
extra genes in the name e.g. GeneA..GeneB..GeneC --> GeneA

pieterjanvc · May 8, 2021, 1:02pm

Hi,

Here you go:

library(tidyverse)

myData = structure(
  list(
    Sample_1 = c(0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6),
    Sample_2 = c(1.15, 1.15, 1.15, 1.15, 0.6, 0.6, 0.6),
    Sample_3 = c(0.6, 0.6, 0.6, 0.7, 0.7, 0.7, 0.7)
  ),
  class = "data.frame",
  row.names = c(
    "GeneE",
    "GeneF",
    "GeneK",
    "GeneM",
    "GeneA..GeneB..GeneC",
    "GeneXX",
    "GeneX..GeneY..GeneYY"
  )
)


#Remove rows
myData %>% rownames_to_column("gene") %>% 
  filter(!str_detect(gene, "\\.\\."))
#>     gene Sample_1 Sample_2 Sample_3
#> 1  GeneE      0.6     1.15      0.6
#> 2  GeneF      0.6     1.15      0.6
#> 3  GeneK      0.6     1.15      0.6
#> 4  GeneM      0.6     1.15      0.7
#> 5 GeneXX      0.6     0.60      0.7

#Rename rows
myData %>% rownames_to_column("gene") %>% 
  mutate(
    gene = str_remove(gene, "\\.\\..*")
  )
#>     gene Sample_1 Sample_2 Sample_3
#> 1  GeneE      0.6     1.15      0.6
#> 2  GeneF      0.6     1.15      0.6
#> 3  GeneK      0.6     1.15      0.6
#> 4  GeneM      0.6     1.15      0.7
#> 5  GeneA      0.6     0.60      0.7
#> 6 GeneXX      0.6     0.60      0.7
#> 7  GeneX      0.6     0.60      0.7

^{Created on 2021-05-08 by the reprex package (v2.0.0)}

Both solutions use regex.

The first one detects any string with the divider .. and removes them using the pattern \.\. (or \\.\\. in R to properly escape characters).
The second one crops every string after the first divider. This is written in regex like \.\..* (or again \\.\\..* in R) with \.\. being the divider and .* meaning everything that follows

Hope this helps,
PJ

mtoufiq · May 8, 2021, 1:27pm

@pieterjanvc, excellent. Thank you so much.

system · May 15, 2021, 1:28pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.