Hi,
I have a challenge with deduplication.
I have a file with four respondents aaa, bbb, ccc and ddd.
- Respondent aaa has all variables the same apart from two text fields (SERVICE_DESCRIPTION, BOOKING_NOTES).
- Respondent bbb has almost all variables the same apart from numerical values
- Respondent ccc has two identical records
- Respondent ddd has only one response
This is the file:
data.source <- data.frame(
stringsAsFactors = FALSE,
URN = c("aaa", "aaa", "bbb", "bbb", "ccc", "ccc", "ddd"),
RegistrationDate = c("2016-07-20",
"2016-07-20","2014-12-31","2014-12-31","2017-12-28",
"2017-12-28","2018-12-28"),
InterviewDate = c("2019-01-03 08:31:00",
"2019-01-03 08:31:00","2019-01-05 20:04:00",
"2019-01-05 20:04:00","2019-01-11 19:59:00",
"2019-01-11 19:59:00","2019-01-11 19:59:00"),
SERVICE_DESCRIPTION = c("Carry out something",
"Check for uneven wear","bla bla","bla bla","clock",
"clock","xxx"),
BOOKING_NOTES = c("detail 1",
"wear at rear","parts","parts","issue","issue","yyy"),
Duration_Minutes = c(0, 0, 132, 132, 0, 0, 0),
JOB_START_TIME = c("1900-01-01 00:00:00",
"1900-01-01 00:00:00","2018-11-22 11:00:00",
"2018-12-17 11:00:00","1900-01-01 00:00:00",
"1900-01-01 00:00:00","2018-01-01 00:00:00"),
JOB_END_TIME = c("1900-01-01 00:00:00",
"1900-01-01 00:00:00","2018-11-22 15:30:00",
"2018-12-17 15:30:00","1900-01-01 00:00:00",
"1900-01-01 00:00:00","2018-01-01 00:00:00"),
Model = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC", "CCC"),
Mileage = c(35722, 35722, 15399, 15399, 8037, 8037, 8037),
YEAR = c(2016, 2016, 2014, 2014, 2017, 2017, 2017),
ROTotal = c(503.35, 503.35, 164.91, 0, 189.88, 189.88, 189.88),
LabourSale = c(203.5, 203.5, 116.6, 0, 114.3, 114.3, 114.3)
)
Now, I would like to remove all duplicates with the following conditions:
- If just numerical variables are different (bbb), use averages of these values
- If comment variables are different (SERVICE_DESCRIPTION, BOOKING_NOTES) and numerical values are different (aaa) , merge texts from comment variables and calculate average score from numerical variables
- If everything is the same (ccc) , keep one record
- Keep other records unchanged (ddd)
I've done this so far:
library(tidyverse)
data.dedup <- data.source %>%
group_by(URN, RegistrationDate, InterviewDate, SERVICE_DESCRIPTION, BOOKING_NOTES, JOB_START_TIME, JOB_END_TIME, Model) %>%
summarise(Duration_Minutes = mean(Duration_Minutes), Mileage = mean(Mileage), YEAR = mean(YEAR), ROTotal = mean(ROTotal), LabourSale = mean(LabourSale))
library(tidyverse)
data.dedup2 <- data.source %>%
group_by(URN, RegistrationDate, InterviewDate, JOB_START_TIME, JOB_END_TIME, Model) %>%
summarise(Duration_Minutes = mean(Duration_Minutes), Mileage = mean(Mileage), YEAR = mean(YEAR), ROTotal = mean(ROTotal), LabourSale = mean(LabourSale))
but I need a complex solution.
As a result, I need four rows of data.
Can you help?