How does left_join, dplyr self-join with filter work ?

Hi All,
My question is inspired by this post:

I would like to understand how this two dataframes (one of them is a subset from the other) can be joined together if a first dataframe has got 15 rows:

vas = data_frame(
  user = rep(letters[1:n_users], each = 3),
  group = rep(c("baseline", "early", "late" ),n_users),
  vas = round(rgamma(n_users*3, 10,1.4 ))

and the other has got 5 rows:

vas_baseline <- vas_1 %>% filter(group == "baseline")


vas_1 = vas %>%
  mutate(vas = vas * 2)

vas_with_diff <-  vas_1 %>%
  left_join(vas_baseline, by = c("user")) %>%
  mutate(vas_difference = vas.x - vas.y)

We finally got a new dataframe with 15 rows and 6 columns (no NAs - surprisingly to me?):

I apologize fort a naive question, but would it be possible to see where all cells (values) from
"vas_baseline" are situated now in "vas_with_diff" ? Sort of like maybe conditional formatting in an Excel way.
What would have happened if "vas_baseline" had more columns than "vas_1" ?


its no na's because you only joined 'by user' rather than by user & group. So you get a value wherever there is a user match, which is everywhere

The generation of NA values as a result of a join is dependent on the joining keys, not the number of rows in the data frames being joined.

In the example, vas_1 and vas_baseline are being left joined using only the user variable. So, the variables in the right data frame (vas_baseline) will be joined to vas_1 wherever a match is found for user. Since both data frames contain all 5 possibilities (a to e), there are no NA values in the result.

If instead you added group to the join criteria, then you would get NA values, since the vas_baseline data frame contains only one group type.

left_join(vas_1, vas_baseline, by = c("user", "group"))
# A tibble: 15 x 4
   user  group    vas.x vas.y
   <chr> <chr>    <dbl> <dbl>
 1 a     baseline    14    14
 2 a     early       14    NA
 3 a     late        22    NA
 4 b     baseline    12    12
 5 b     early       10    NA
 6 b     late        18    NA
 7 c     baseline    22    22
 8 c     early       14    NA
 9 c     late        16    NA
10 d     baseline    16    16
11 d     early       10    NA
12 d     late        10    NA
13 e     baseline    16    16
14 e     early       16    NA
15 e     late        26    NA

I don't think there is a way to highlight matched rows (if I understand correctly what you mean). The rows which returned NA for the variables in the right data frame are the unmatched rows.

Thank you All for your reply,
I have tried in Excel and below there is a result.
I hope that is correct:

If this is not correct, please let me know.
Or even this:

I wish I could do it in R.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.