Deduplication with some conditions

Slavek · November 17, 2020, 4:11pm

Hi,
I have a challenge with deduplication.
I have a file with four respondents aaa, bbb, ccc and ddd.

Respondent aaa has all variables the same apart from two text fields (SERVICE_DESCRIPTION, BOOKING_NOTES).
Respondent bbb has almost all variables the same apart from numerical values
Respondent ccc has two identical records
Respondent ddd has only one response

This is the file:

data.source <- data.frame(
     stringsAsFactors = FALSE,
                  URN = c("aaa", "aaa", "bbb", "bbb", "ccc", "ccc", "ddd"),
     RegistrationDate = c("2016-07-20",
                          "2016-07-20","2014-12-31","2014-12-31","2017-12-28",
                          "2017-12-28","2018-12-28"),
        InterviewDate = c("2019-01-03 08:31:00",
                          "2019-01-03 08:31:00","2019-01-05 20:04:00",
                          "2019-01-05 20:04:00","2019-01-11 19:59:00",
                          "2019-01-11 19:59:00","2019-01-11 19:59:00"),
  SERVICE_DESCRIPTION = c("Carry out  something",
                          "Check for uneven wear","bla bla","bla bla","clock",
                          "clock","xxx"),
        BOOKING_NOTES = c("detail 1",
                          "wear at rear","parts","parts","issue","issue","yyy"),
     Duration_Minutes = c(0, 0, 132, 132, 0, 0, 0),
       JOB_START_TIME = c("1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-11-22 11:00:00",
                          "2018-12-17 11:00:00","1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-01-01 00:00:00"),
         JOB_END_TIME = c("1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-11-22 15:30:00",
                          "2018-12-17 15:30:00","1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-01-01 00:00:00"),
                Model = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC", "CCC"),
              Mileage = c(35722, 35722, 15399, 15399, 8037, 8037, 8037),
                 YEAR = c(2016, 2016, 2014, 2014, 2017, 2017, 2017),
              ROTotal = c(503.35, 503.35, 164.91, 0, 189.88, 189.88, 189.88),
           LabourSale = c(203.5, 203.5, 116.6, 0, 114.3, 114.3, 114.3)
)

Now, I would like to remove all duplicates with the following conditions:

If just numerical variables are different (bbb), use averages of these values
If comment variables are different (SERVICE_DESCRIPTION, BOOKING_NOTES) and numerical values are different (aaa) , merge texts from comment variables and calculate average score from numerical variables
If everything is the same (ccc) , keep one record
Keep other records unchanged (ddd)

I've done this so far:

library(tidyverse)
data.dedup <- data.source %>% 
  group_by(URN,	RegistrationDate,	InterviewDate,	SERVICE_DESCRIPTION,	BOOKING_NOTES,	JOB_START_TIME,	JOB_END_TIME,	Model) %>% 
  summarise(Duration_Minutes = mean(Duration_Minutes), Mileage = mean(Mileage), YEAR = mean(YEAR), ROTotal = mean(ROTotal), LabourSale = mean(LabourSale))


library(tidyverse)
data.dedup2 <- data.source %>% 
  group_by(URN,	RegistrationDate,	InterviewDate,	JOB_START_TIME,	JOB_END_TIME,	Model) %>% 
  summarise(Duration_Minutes = mean(Duration_Minutes), Mileage = mean(Mileage), YEAR = mean(YEAR), ROTotal = mean(ROTotal), LabourSale = mean(LabourSale))

but I need a complex solution.
As a result, I need four rows of data.

Can you help?

AlexisW · November 17, 2020, 5:34pm

I think your problem becomes simpler if you only group by the URN, then apply a function to each sub-data.frame (for each URN value). Something along these lines:

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(4:5, ~paste(unique(.x), collapse="/")))
}

data.source %>%
  group_by(URN) %>%
  nest() %>%
  mutate(collapsed_data = map(data, collapse_entries)) %>%
  select(-data) %>%
  unnest(collapsed_data)

Does that help?

Slavek · November 17, 2020, 5:50pm

Wow! Excellent, although I need to keep remaining variables as all of them will be used later. Perhaps I could merge the result of your manipulation with the original file?
Also, how can I repeat merging BOOKING_NOTES (beautifully done) to the SERVICE_DESCRIPTION (not done yet)? Should across be related to the 4th and 5th variable? Only the 4th was merged...

AlexisW · November 17, 2020, 6:01pm

Oh, no that was a typo from me: I first tried the summarize on a filtered df (so, with the URN column), then copy-pasted in the function; but of course the subdf in the function doesn't have the URN column. One clear reason not to use column numbers to select columns!

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/")))
}

That's one possibility, another is to keep them in the summary(), I just wasn't sure I understood exactly what you where trying to do. So this should keep RegistrationDate without modification:

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/")),
              RegistrationDate)
}

Of course, you can use across("RegistrationDate", unique) or something like that to collapse the identical dates.

Slavek · November 18, 2020, 12:07pm

Thank you very much.
I can see that using this code in my large data frame takes ages:

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/")),
              across("RegistrationDate", unique),
              across("InterviewDate", unique),
    )
}
dedup.data <- data.source %>%
  group_by(URN) %>%
  nest() %>%
  mutate(collapsed_data = map(data, collapse_entries)) %>%
  select(-data) %>%
  unnest(collapsed_data)

I think that adding missing variables by merging should be much quicker. Nevertheless, whichever type of merging I'm using I always have all 7 records rather than just 4. I have tried all options but the result is always the same: 7 rows


collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/"))
                            )
}
dedup.data <- data.source %>%
  group_by(URN) %>%
  nest() %>%
  mutate(collapsed_data = map(data, collapse_entries)) %>%
  select(-data) %>%
  unnest(collapsed_data)

merged.data <- merge(data.source, dedup.data, by="URN",all = FALSE,
                     sort = TRUE, no.dups = TRUE)
merged.data

What am I doing wrong?

AlexisW · November 18, 2020, 2:34pm

I don't understand what you're trying to do with the merge? Isn't dedup.data the result you are aiming for?

Slavek · November 18, 2020, 4:08pm

Yes, I need dedup.data but including all variables.
I think there are two ways but merging dedup.data with the original data could add missing variables easily...

AlexisW · November 18, 2020, 4:19pm

Alright, then you just need to remove the columns that have been deduplicated, since they haven't been deduplicated in the data.source, the merge would keep all rows.

source.no.dup <- select(data.source,
             -c(where(is.numeric),"SERVICE_DESCRIPTION","BOOKING_NOTES","RegistrationDate","InterviewDate", "JOB_START_TIME","JOB_END_TIME")) %>%
  distinct()

merge(dedup.data, source.no.dup,
      by="URN", all=FALSE)

Note that I also had to remove JOB_START_TIME and JOB_END_TIME since they differ between the two bbbrows.

Slavek · November 18, 2020, 4:23pm

Thank you very much master!

system · November 25, 2020, 4:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.