I ran one fuzzy match pass on my data (obtaining names_match
) but there are some mismatches that I arranged, subsetted out and corrected with another iteration of fuzzy matching (correct
).
Now the trouble I have is trying to append the correctly matched names back to my overall dataset, by comparing the name and their group. I am unsure how to get case_when
to work by passing entire columns/vectors into it, instead of having to name each case.
Here is some reprex data:
# Output obtained from first fuzzy match, with some mistakes
names_match <- tibble(group = c(rep(1, 3), rep(2, 3), rep(3, 4), rep(4, 3)),
match_label = c("Andy", "Angelica", "Aaron",
"Betty", "Ben", "Barack",
"Chrissy", "Celeste", "Cameron", "Candace",
"Danny", "Dane", "Devline"),
matched_label = c("Angelica P.", "Andy A.", "Aaron J.",
"Barack O.", "Betty F.", "Ben T.",
"Celeste J.", "Chrissy L.", "Cameron W.", "Candace O.",
"Danny S.", "Dane D.", "Devline J.")
)
names_match # mismatched rows: 1, 2, 4, 5, 6, 7, 8
# A tibble: 13 x 3
group to_match matched_label
<dbl> <chr> <chr>
1 1 Andy Angelica P.
2 1 Angelica Andy A.
3 1 Aaron Aaron J.
4 2 Betty Barack O.
5 2 Ben Betty F.
6 2 Barack Ben T.
7 3 Chrissy Celeste J.
8 3 Celeste Chrissy L.
9 3 Cameron Cameron W.
10 3 Candace Candace O.
11 4 Danny Danny S.
12 4 Dane Dane D.
13 4 Devline Devline J.
# Subset top_n rows with mismatches
# then rematched correctly
correct <- tibble(group = c(rep(1, 3), rep(2, 3), rep(3, 2)),
match_label = c("Andy", "Angelica", "Aaron",
"Betty", "Ben", "Barack",
"Chrissy", "Celeste"),
correct_label = c("Andy A.", "Angelica P.", "Aaron J.",
"Betty F.", "Ben T.", "Barack O.",
"Chrissy L.", "Celeste J.")
)
correct
# A tibble: 8 x 3
group to_match correct_label
<dbl> <chr> <chr>
1 1 Andy Andy A.
2 1 Angelica Angelica P.
3 1 Aaron Aaron J.
4 2 Betty Betty F.
5 2 Ben Ben T.
6 2 Barack Barack O.
7 3 Chrissy Chrissy L.
8 3 Celeste Celeste J.
A 'dumb' method I have only works when the row order is exactly the same, which kinda defeats the purpose of a case_when
.
# This only works if the order is exactly the same
named_vec <- deframe(tibble(correct$to_match, correct$correct_label))
named_vec
names_match %>%
slice(1:8) %>%
mutate(new_match = case_when(to_match == named_vec ~ named_vec,
TRUE ~ named_vec)
) %>%
# Append the rest of the df back
add_row(names_match %>% slice(9:n())) %>%
# Coalesce the two columns, keeping new name if not NA
mutate(matched_label = ifelse(is.na(new_match) == FALSE,
new_match,
matched_label)) %>%
select(-new_match)
# A tibble: 13 x 3
group to_match matched_label
<dbl> <chr> <chr>
1 1 Andy Andy A.
2 1 Angelica Angelica P.
3 1 Aaron Aaron J.
4 2 Betty Betty F.
5 2 Ben Ben T.
6 2 Barack Barack O.
7 3 Chrissy Chrissy L.
8 3 Celeste Celeste J.
9 3 Cameron Cameron W.
10 3 Candace Candace O.
11 4 Danny Danny S.
12 4 Dane Dane D.
13 4 Devline Devline J.
I am aware this can also be done with a join, but i'm trying to find out how to work with case_when
so i'll stick to this rather contrived example