A data cleaning process I often find myself doing is replacing missing or incorrect data in one dataset using another reference dataset. I feel like I'm missing some well established methods or tools around this practice.
I do something like this reprex:
library(tidyverse)
#make some bad data
bad_cars <- mtcars %>%
rownames_to_column(var = "brand_model") %>%
select(brand_model, cyl, gear, carb) %>%
mutate(cyl = ifelse(cyl>4, cyl-1, cyl), #making some bad data
carb = ifelse(carb<2, NA, carb)) %>% #making some bad data
as.tibble()
#reference data
good_cars <- mtcars %>%
rownames_to_column(var = "brand_model") %>%
select(brand_model, cyl, gear, carb) %>%
as.tibble()
#function to correct bad data with reference data using a 'mutating' join
mutate_join <- function(x, y, by, var){
var = enquo(var)
var_ref = sym(str_c(quo_name(var), "_ref"))
x %>%
left_join(select(y, by, !!var), by = by, suffix = c("", "_ref")) %>%
mutate(!!quo_name(var) := ifelse(!!var == !!var_ref, !!var, !!var_ref)) %>% #different data
mutate(!!quo_name(var) := ifelse(is.na(!!var), !!var_ref, !!var)) %>% #missing data
select(-matches("_ref")) #remove ref col
}
#demonstrate
corrected_cars <-
bad_cars %>%
mutate_join(good_cars, by = "brand_model", cyl) %>%
mutate_join(good_cars, by = "brand_model", carb)
corrected_cars
#confirm
identical(corrected_cars, good_cars)
This could be done a lot better I'm sure. Are there any good practices/packages people can recommend around this process of selectively correcting a df based on a reference df?
Thanks for the feedback. And, well, when you put it that way maybe I don't have as much of a problem as I think.
In practice the reason I do this is I get data from a lot of different regions all over the country who have locally collected data which is usually best, but occasionally missing or wrong on a few select fields' values that another group has data on. So I'll define some logic around that depending on the type/scenario etc...
But I think you're right, I'm probably overthinking this (odd never done that before) and could probably replace whole fields to same effect more often.
Thanks again!
Cheers,
Ben
PS Huge fan of your episode on data.framed, really appreciated it.
I keep telling my wife I'm a big deal online. She's unimpressed at best. I'm sending her a link to this thread as proof.
I run into this when I have say county level data and I want logic that does something like:
if county data outside of some range then use district data
if the district value is outside of some tests then use state data
I always end up joining district data & state data to my county data then doing a somewhat ugly case_when type of logic statement. I too feel like this is sort of ugly and seems like it should be a function. But I run into this about once a year and have just not been motivated to really fix it. It also seems like every time I have this situation the logic around when I want to replace values is different. That makes it less than trivial to abstract into a function.