Hi, I hope this might be useful for people dealing with incorrect data entry where most of rows match a key variable but some don't.
I have this sample where we have car registration number (key info) and model name (which might be incorrect)
data.source <- data.frame(
stringsAsFactors = FALSE,
Model = c("AAA","AAA","AAA","AAA",
"BBB","CCC","CCC","DDD","CCC","CCC","EEE","FFF",
"GGG","GGG"),
RegNo = c("A16XX","A16XX","A16XX",
"A16XX","A16XX","141LHXX","141LHXX","141LHXX","141LHXX",
"141LHXX","172XX","172XX","172XX","172XX")
)
data.source
I can look at proportion of Model within RegNo to see majority of inputs, assuming they are correct if Model.prop is >=0.5
library(dplyr)
result <- data.source |>
group_by(RegNo) %>%
add_count(RegNo, name = "reg.count") %>%
mutate(reg.count = reg.count) %>%
group_by(Model, RegNo) %>%
add_count(Model, RegNo, name = "model.count") %>%
mutate(model.count = model.count) %>%
mutate(Model.prop = model.count/reg.count)%>%
arrange(Model, RegNo)
result
Now, is it possible to replace Model name with Model.prop <0.5 by Model name with Model.prop >=0.5 within the same RegNo?
In this example BBB should be changed to AAA, DDD to CCC, EEE and FFF to GGG
I am sure we could do that in R but I don't know how...