Combining .x and .y columns after left_join()

lawrencelmli · September 17, 2020, 1:48pm

I have a problem but it is difficult to include data as they are sensitive:I have 2 databases, db1 has 749 rows and 110 columns and db2 with 28 rows and 46 columns. All the entries from db2 match some of the entries in db1 .
All of the columns from db2 are also contained with db1 but with new values in db2 that I wish to keep to replace NA s in db1 . I can use left_join(db1, db2) but this ends up with columns that ends with .x and .y . Whilst I am open to the idea of sorting these columns individually, I wonder if there is a better way to sort this within R.I am sorry that I can't make an example to illustrate my problem and would be grateful for any help/solution.

nirgrahamuk · September 17, 2020, 2:07pm

My recommendation is that you read the documentation for the tribble() function (tibble package) , and then put in the effort to manually make a small artificial example that captures the essence of your scenario.

kaijabean · September 17, 2020, 6:03pm

Hi @lawrencelmli,
I agree with @nirgrahamuk that it would be great if you could provide a small fake dataset so we can play around with your problem. But just from your description, I think I've had the same issue before.

What you want to do is essentially to "patch" values in one data frame with values from another, when and only when they're missing. This is a pretty common need that I've also run into semi-frequently in my work, and as far as I'm aware there currently isn't great support for it in the tidyverse. It's been discussed in the past, e.g. here, but I don't think it really exists yet.

What I usually do, in general terms, is:

Rename one of the columns before joining, so you don't end up with .x and .y: e.g. if you have var in each data frame, rename one of them to var1 before joining.
Perform the join
Use case_when() or a similar function to set the value of var in the joined data frame to the value of var1 if is.na(var)
Use select() to remove the var1 column

I know that's a little more manual than you were probably looking for, but it's the best solution I have for this. I'd be happy to write out some minimal code for you if you can provide a sample data frame!

lawrencelmli · September 17, 2020, 7:34pm

Hi both,

Thanks for taking the time to reply. Here's a very simplified version of the dataframe:
> dput(df_merged) structure(list(blood_loss.x = c(0, 100, 150, NA, NA, NA), blood_loss.y = c(NA, NA, NA, 200, 10, 200), hospital_stay.x = c(1, 2, 3, NA, NA, NA ), hospital_stay.y = c(NA, NA, NA, 5, 8, 2)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

I discovered that I can use the coalesce() function to join the columns the way I want. However, I was wondering if there is a way I can use apply or write a function to find all the matching columns that end with .x, and .y and pass the entire dataframe through to coalesce them.

olibravo · September 18, 2020, 10:53am

You can always tidyr::gather(), then tidyr::separate an finally tidyr::spread(). Gathering gives you the long version of a table where original colnames are now a variable, splitting splits the new column into two: original feature name and the variant .x or .y. Finally, you spread the table by the variant.

df_merged <- structure(list(blood_loss.x = c(0, 100, 150, NA, NA, NA), blood_loss.y = c(NA, NA, NA, 200, 10, 200), hospital_stay.x = c(1, 2, 3, NA, NA, NA ), hospital_stay.y = c(NA, NA, NA, 5, 8, 2)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))


df_merged <- df_merged %>% 
  tibble::rowid_to_column()

df_merged %>% 
  tidyr::gather("key", "value", -rowid) %>% 
  tidyr::separate("key", into = c("feature", "variant"), sep = "\\.") %>% 
  tidyr::spread(variant, value) %>% 
  mutate(final_value = coalesce(x, y))

Is it what You're looking for?

nirgrahamuk · September 18, 2020, 11:15am

Here's an option

library(tidyverse)

(df_merged <- (structure(list(
  blood_loss.x = c(0, 100, 150, NA, NA, NA),
  blood_loss.y = c(NA, NA, NA, 200, 10, 200),
  hospital_stay.x = c(1, 2, 3, NA, NA, NA),
  hospital_stay.y = c(NA, NA, NA, 5, 8, 2)), 
  row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))))


xn <- names(df_merged)[endsWith(names(df_merged),".x")]

nms <- str_replace(string = xn,
                   pattern = ".x",
                   replacement="")
map_dfc(nms,
     ~ coalesce(df_merged[[paste0(.,".x")]],
                df_merged[[paste0(.,".y")]]
     )) %>% setNames(nms)

# # A tibble: 6 x 2
# blood_loss hospital_stay
# <dbl>         <dbl>
# 1          0             1
# 2        100             2
# 3        150             3
# 4        200             5
# 5         10             8
# 6        200             2

lawrencelmli · September 18, 2020, 8:06pm

Hi,

That's amazing. As someone who is quite new to R, can I check if I have understood this correctly?

So, you identified all the column names with and assigned this to xn (I did not know that you can subset within one line of code like this!).

Then the nms was used to remove all the ".x" to be used as the new column names.

Then map_dfc is used to apply the coalesce() function over all the columns that I wish to coalesce. I am not too familiar with the purr package and have a couple more questions:

Why use map_dfc as oppose to map_df?
With regards to the syntax of map_dfc, why is the first argument nms and not the actual data?
I have columns within the data that is unique and not duplicated with .x. and .y, e.g.
id = c(0001, 0002, 0003, 0004, 0005, 0006), if I use the solution, these will be removed, would the best solution be to do a cbind and then simply remove all the columns with the .x. and .y names?

nirgrahamuk · September 18, 2020, 9:29pm

map_df is when your function constructs a dataframe, and I chose not to. I chose to column bind the vectors into a dataframe with dfc variant.

Its iterating over each nms , i.e. each variable can be picked from the dataframe and coalesced, in turn, the column bind the results

you can just cbind (I would probably reach for bind_cols ) the coalesced variables in this dataframe with the names not in nms in the original dataframe.

system · September 25, 2020, 9:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.