Merge Two Dataframes

chetanpatil · October 19, 2018, 4:24am

Hi All,

I have two dataframes with same number of columns (number of rows can differ). Both dataframe contains an unique identifier column. I want to merge these dataframe as such that unique identifier matched column are binded in one row together and if the unique identifier is not in any one of these then append at the end of that specific dataframe.

I tried many non-tidyverse solution and nothing worked out. Please help if this can be done using tidyverse.

Thanks.

jcblum · October 19, 2018, 4:37am

I’m afraid I’m having trouble visualizing your desired output here. It’s much easier to understand coding questions by looking at code, rather than reading a verbal description. I think you’ll be more likely to get helpers to take an interest in this question if you can make a small reproducible example demonstrating what you mean.

No need to include your actual data frames — small example data frames with a similar structure will probably be easier to understand. It would also be a big help if you can include an example of what the desired output should look like.

chetanpatil · October 19, 2018, 5:02am

Sorry about that @jcblum

Here is the example. Data 1 and Data 2 is what I have as dataframe and I want output as in third table shown below.

I tried this: final_data <- merge(data.frame(data_1, row.names=NULL), data.frame(data_2, row.names=NULL), by = 0, all = TRUE)[-1]

But this doesn't give me the output as desired in third table below. Please suggest any solution if you may.

Data 1:

A B C D
1 2 3 A
1 2 3 B
1 2 3 C
1 2 3 D
1 2 3 E

Data 2:

A1 B2 C2 D2
1 2 3 D
1 2 3 E
1 2 3 F
1 2 3 A
1 2 3 B
1 2 3 P

Output:

A B C D A1 B2 C2 D2
1 2 3 A 1 2 3 A
1 2 3 B 1 2 3 B
1 2 3 C 0 0 0 0
1 2 3 D 1 2 3 D
1 2 3 E 1 2 3 E
0 0 0 0 1 2 3 P
0 0 0 0 1 2 3 F

mishabalyasin · October 19, 2018, 8:10am

Take a look at dplyr::full_join or data.table::merge. You'll need to convert NA's afterwards to 0, but that shouldn't be too difficult. Also, you probably don't need 2 columns D and D2 since it's a key you merge with meaning that you'll only have one of them in the result.

chetanpatil · October 19, 2018, 2:06pm

@mishabalyasin: Thank you.

I need column D and D2 in order to verify few more things later on.

mishabalyasin · October 19, 2018, 2:56pm

You know your code better, of course, but keep in mind that both dplyr::full_join and data.table::merge won't return you a duplicated key column. Also, both of the columns contain identical (not similar) information, so once you've checked it with one column, you've checked it with second as well.

chetanpatil · October 19, 2018, 4:19pm

@mishabalyasin:

Yes, but the problem is I want to keep those values from D and D2 that don't match and append it at the end as shown below:

A B C D A1 B2 C2 D2
1 2 3 A 1 2 3 A
1 2 3 B 1 2 3 B
1 2 3 C 1 2 3 C
1 2 3 D 1 2 3 D
1 2 3 E 1 2 3 E
0 0 0 0 1 2 3 P
0 0 0 0 1 2 3 F

mishabalyasin · October 19, 2018, 4:49pm

From dplyr::full_join help (emphasis mine):

full_join()
return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

So it'll keep track of those things by nature of having NA's in some of the columns.

I would show you how it looks in code, but it's not possible right now to copy-paste your example into my session. Reprex link @jcblum shared above can help you create one.

chetanpatil · October 19, 2018, 6:20pm

@mishabalyasin:

Does following reproducible data helps?

data_1 <- read.table(text = "1 2 3 A
                             1 2 3 B
                             1 2 3 C
                             1 2 3 D
                             1 2 3 E")
colnames(data_1) <- c('A','B','C','D')

data_2 <- read.table(text = "
                     1 2 3 D
                     1 2 3 E
                     1 2 3 F
                     1 2 3 A
                     1 2 3 B
                     1 2 3 P")
colnames(data_2) <- c('A1','B1','C1','D1')

mishabalyasin · October 19, 2018, 7:33pm

Yes, this helps a lot. The result of full_join:

data_1 <- read.table(text = "1 2 3 A
                             1 2 3 B
                             1 2 3 C
                             1 2 3 D
                             1 2 3 E", stringsAsFactors = FALSE)
colnames(data_1) <- c('A','B','C','D')

data_2 <- read.table(text = "
                     1 2 3 D
                     1 2 3 E
                     1 2 3 F
                     1 2 3 A
                     1 2 3 B
                     1 2 3 P", stringsAsFactors = FALSE)
colnames(data_2) <- c('A1','B1','C1','D1')

dplyr::full_join(data_1, data_2, by = c("D" = "D1"))
#>    A  B  C D A1 B1 C1
#> 1  1  2  3 A  1  2  3
#> 2  1  2  3 B  1  2  3
#> 3  1  2  3 C NA NA NA
#> 4  1  2  3 D  1  2  3
#> 5  1  2  3 E  1  2  3
#> 6 NA NA NA F  1  2  3
#> 7 NA NA NA P  1  2  3

^{Created on 2018-10-19 by the reprex package (v0.2.1)}

As you can see, you now have NA's that you can convert to 0 and presence of NA's tells you that this level wasn't present in one or another dataframe.

chetanpatil · October 20, 2018, 7:29pm

Thank you so much @mishabalyasin