Mutating/Replacing joins anybody?


A data cleaning process I often find myself doing is replacing missing or incorrect data in one dataset using another reference dataset. I feel like I'm missing some well established methods or tools around this practice.

I do something like this reprex:


#make some bad data
bad_cars <- mtcars %>% 
  rownames_to_column(var = "brand_model") %>% 
  select(brand_model, cyl, gear, carb) %>% 
  mutate(cyl = ifelse(cyl>4, cyl-1, cyl), #making some bad data
         carb = ifelse(carb<2, NA, carb)) %>%  #making some bad data

#reference data
good_cars <- mtcars %>% 
  rownames_to_column(var = "brand_model") %>% 
  select(brand_model, cyl, gear, carb) %>% 

#function to correct bad data with reference data using a 'mutating' join 
mutate_join <- function(x, y, by, var){
  var = enquo(var)
  var_ref = sym(str_c(quo_name(var), "_ref"))
  x %>% 
    left_join(select(y, by, !!var), by = by, suffix = c("", "_ref")) %>% 
    mutate(!!quo_name(var) := ifelse(!!var == !!var_ref, !!var, !!var_ref)) %>% #different data
    mutate(!!quo_name(var) := ifelse(!!var), !!var_ref, !!var)) %>% #missing data
    select(-matches("_ref")) #remove ref col

corrected_cars <- 
  bad_cars %>% 
    mutate_join(good_cars, by = "brand_model", cyl) %>% 
    mutate_join(good_cars, by = "brand_model", carb)


identical(corrected_cars, good_cars)

This could be done a lot better I'm sure. Are there any good practices/packages people can recommend around this process of selectively correcting a df based on a reference df?


great reprex!

My first thought is: that works, so not sure there's a reason to mess with it :slight_smile:

My second thought is, if you don't trust the data in bad_cars and will override it with what's in good_cars then why even bring it in at all? Why not:

bad_cars %>%
  select(brand_model, gear) %>%
  left_join(select(good_cars, brand_model, cyl, carb))

Maybe in practice your mutate_join function has more complicated logic to determine if a value is bad?

1 Like

Hi JD,

Thanks for the feedback. And, well, when you put it that way maybe I don't have as much of a problem as I think. :slight_smile:

In practice the reason I do this is I get data from a lot of different regions all over the country who have locally collected data which is usually best, but occasionally missing or wrong on a few select fields' values that another group has data on. So I'll define some logic around that depending on the type/scenario etc...

But I think you're right, I'm probably overthinking this (odd never done that before) and could probably replace whole fields to same effect more often.

Thanks again!


PS Huge fan of your episode on data.framed, really appreciated it.

1 Like

I keep telling my wife I'm a big deal online. She's unimpressed at best. I'm sending her a link to this thread as proof.

I run into this when I have say county level data and I want logic that does something like:

  • if county data outside of some range then use district data
  • if the district value is outside of some tests then use state data

I always end up joining district data & state data to my county data then doing a somewhat ugly case_when type of logic statement. I too feel like this is sort of ugly and seems like it should be a function. But I run into this about once a year and have just not been motivated to really fix it. It also seems like every time I have this situation the logic around when I want to replace values is different. That makes it less than trivial to abstract into a function.

1 Like