Automatic correction of wrong data entry issue

Hi, I hope this might be useful for people dealing with incorrect data entry where most of rows match a key variable but some don't.
I have this sample where we have car registration number (key info) and model name (which might be incorrect)

data.source <- data.frame(
  stringsAsFactors = FALSE,
             Model = c("AAA","AAA","AAA","AAA",
                       "BBB","CCC","CCC","DDD","CCC","CCC","EEE","FFF",
                       "GGG","GGG"),
             RegNo = c("A16XX","A16XX","A16XX",
                       "A16XX","A16XX","141LHXX","141LHXX","141LHXX","141LHXX",
                       "141LHXX","172XX","172XX","172XX","172XX")
)

data.source

I can look at proportion of Model within RegNo to see majority of inputs, assuming they are correct if Model.prop is >=0.5

library(dplyr)
result <- data.source |> 
  group_by(RegNo) %>% 
  add_count(RegNo, name = "reg.count") %>% 
  mutate(reg.count = reg.count) %>%
  group_by(Model, RegNo) %>% 
  add_count(Model, RegNo, name = "model.count") %>% 
  mutate(model.count = model.count) %>% 
  mutate(Model.prop = model.count/reg.count)%>% 
  arrange(Model, RegNo)

result

Now, is it possible to replace Model name with Model.prop <0.5 by Model name with Model.prop >=0.5 within the same RegNo?
In this example BBB should be changed to AAA, DDD to CCC, EEE and FFF to GGG

I am sure we could do that in R but I don't know how...

(result_2 <- group_by(result, RegNo) |>
  summarise(
    maxprop = max(Model.prop),
    likelymodel = setdiff(case_when(
      Model.prop == maxprop ~ Model,
      TRUE ~ ""), "")
  ))

(result_3 <- left_join(result, result_2) |>
  rename(
    oldmodelcol = Model,
    Model = likelymodel
  ) |>
  relocate("Model"))
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.