Duplicates with conditions

Slavek · October 16, 2019, 12:10pm

Hi,
I'm trying to subset unique ReferenceNumbers where Dates and Decstriptions are the same.

I have prepared this simple example

library(dplyr)

My.data <- data.frame(stringsAsFactors=FALSE,
   ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
              Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                       "2018-01-11", "2018-01-12", "2018-11-27", "2018-11-27"),
       Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                       "ccc")
)
My.data

I know how to remove duplicates based on one variable:

My.data.New <- My.data[!duplicated(My.data$ReferenceNumber), ]
My.data.New

...but I would like to remove only records with same RerferenceNumber, same Date and same Description (so as a result I should have 6 records).

Can you help please?

Matthias · October 16, 2019, 12:17pm

There are probably more elegant ways to do this, but an easy solution that comes into my mind is combining all 3 columns to 1, e.g. with paste() or str_c(), and checking this for duplicates.

Slavek · October 16, 2019, 12:42pm

I'm also adding one more variable (Cost):


My.data <- 
  data.frame(stringsAsFactors=FALSE,
             ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
             Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                      "2018-01-11", "2018-01-12", "2018-11-27",
                      "2018-11-27"),
             Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                             "ccc"),
             Cost = c(25, 26, 26, 5, 12, 14, 14, 14)
  )

and I want this value be averaged for ReferenceNumber before removing duplicates so value for abc in row 4 and 4 should be 15.5 before removing one of duplicates, the same relates to last two rows where average should be 14.
How can I do that?

jim89 · October 16, 2019, 1:02pm

When you're thinking about duplicates and want just the unique set of values back, you can use dplyr::distinct() to do so:

library(dplyr)
distinct(My.data, ReferenceNumber, Date, Description)

And a trick to remember is, when you want a summary value/statistic for different combinations of other values in your data, what you're really talking about is group_by(), for example, to calculate the average cost for each combination of ReferenceNumber, Date, and Description in your data you could use:

My.data %>% 
    group_by(ReferenceNumber, Date, Description) %>% 
    summarise(Cost = mean(Cost))

# A tibble: 6 x 4
# Groups:   ReferenceNumber, Date [6]
  ReferenceNumber Date       Description  Cost
  <chr>           <chr>      <chr>       <dbl>
1 abc             2017-11-29 aaa          15.5
2 abc             2018-01-11 bbb          12  
3 abc             2018-01-12 bbb          14  
4 abc             2018-11-27 ccc          14  
5 xyz             2019-03-22 bla bla      25  
6 xyz             2019-03-23 bla bla      26

If you want the average for a different set of variables, change what you put in group_by(), or if you want to add an average as a new, repeated variable (rather than summarising and reducing the number of rows in your data), you can use mutate() after group_by(), rather than summarise():

My.data %>% 
    group_by(ReferenceNumber, Date, Description) %>% 
    mutate(Cost = mean(Cost))

# A tibble: 8 x 4
# Groups:   ReferenceNumber, Date, Description [6]
  ReferenceNumber Date       Description  Cost
  <chr>           <chr>      <chr>       <dbl>
1 xyz             2019-03-22 bla bla      25  
2 xyz             2019-03-23 bla bla      26  
3 abc             2017-11-29 aaa          15.5
4 abc             2017-11-29 aaa          15.5
5 abc             2018-01-11 bbb          12  
6 abc             2018-01-12 bbb          14  
7 abc             2018-11-27 ccc          14  
8 abc             2018-11-27 ccc          14

Slavek · October 16, 2019, 1:49pm

Thank you very much for this full explanation of all possible options.
Excellent!!!

Slavek · October 16, 2019, 2:20pm

I have one more question if you don't mind.
Both solutions removing duplicates above (group_by and distinct) create outputs only with variables mentioned in their codes.
My real data frame is significantly bigger.
I am adding two more variables to the data frame:

My.data2 <- 
  data.frame(stringsAsFactors=FALSE,
             ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
             Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                      "2018-01-11", "2018-01-12", "2018-11-27",
                      "2018-11-27"),
             Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                             "ccc"),
             Cost = c(25, 26, 26, 5, 12, 14, 14, 14),
             Cost2 = c(225, 6, 6, 51, 2, 4, 1, 14),
             NewVar = c(2, 6, 6, 5, 1, 1, 4, 4)
  )

When I run your codes on My.data2, all results I see are without "NewVar" and "Cost2". Is any way to keep them so koo keep the structure of the original data frame?

jim89 · October 16, 2019, 3:31pm

Add them to distinct() or group_by() so that they get included in your de-deduplication. They're look like they're also duplicated so it probably makes sense to do so here.

But in general if you have duplicates across just some of the variables in your data (and not others) you'll need to think/plan how you want to deal with the others, which depends on your data and I'm not sure I can give you a generic "do X" answer for that, as it's very dependant on your data and what you're trying to do with them!

Slavek · October 16, 2019, 3:48pm

Let's say the "NewVar" will be used to link this data frame with something else or will be used for further analysis, tabulations etc. I need to keep it and I don't know how I could do that...

jim89 · October 16, 2019, 3:50pm

Is NewVar also duplicated along with ReferenceNumber, Date etc? I'm going to guess that if you need it to link to another data frame, then it probably is also duplicated, but it's not clear from your question.

If it is duplicated, you can either include it in distinct() or group_by() to include it in your de-deduplication. If not, you need to think about if it makes sense to summarise NewVar and code accordingly.

Slavek · October 16, 2019, 4:02pm

Absolutely perfect! Thank you

Slavek · October 17, 2019, 1:22pm

Hi,
I know you have kindly fully answered my question but I hope you don't mind if I ask you one more thing.
Means work perfectly for Numerical variables but is it possible to do something similar with string variables?
Is it possible to keep original string value if they are the same and new (for example "Unsure") if they are different for duplicated records?

My.data <- 
  data.frame(stringsAsFactors=FALSE,
             ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
             Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                      "2018-01-11", "2018-01-12", "2018-11-27",
                      "2018-11-27"),
             Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                             "ccc"),
             Cost = c("High", "Low", "High", "High","High", "Low", "High", "Low"),
             Cost2 = c(225, 6, 6, 51, 2, 4, 1, 14),
             NewVar = c(2, 6, 6, 5, 1, 1, 4, 4)
  )
My.data

My.data.New <- My.data %>% 
  group_by(ReferenceNumber, Date, Description) %>% 
  summarise(Cost = "Unsure")
My.data.New

Obviously my solution is not working but I don't know if this operation is possible at all?
As a result, rows 3 and 4 should have "High", last two rows "Unsure" and other descriptions should stay unchanged...

jim89 · October 18, 2019, 6:19pm

There's not as many pre-defined summary functions for text/string data (e.g. mean(), median() etc don't make sense). But you could write your own custom summary function if you have some specific behaviour you want to use, e.g.:

library(tidyverse)

My.data <- data.frame(
    stringsAsFactors=FALSE,
    ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
    Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29", "2018-01-11", "2018-01-12", "2018-11-27","2018-11-27"),
    Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc", "ccc"),
    Cost = c("High", "Low", "High", "High","High", "Low", "High", "Low"),
    Cost2 = c(225, 6, 6, 51, 2, 4, 1, 14),
    NewVar = c(2, 6, 6, 5, 1, 1, 4, 4)
)

custom_string_summary <- function(string_vec) {
    ifelse(length(unique(string_vec)) == 1, unique(string_vec), "Unsure")
}

My.data %>% 
  group_by(ReferenceNumber, Date, Description) %>% 
  summarise(Cost = custom_string_summary(Cost))
#> # A tibble: 6 x 4
#> # Groups:   ReferenceNumber, Date [6]
#>   ReferenceNumber Date       Description Cost  
#>   <chr>           <chr>      <chr>       <chr> 
#> 1 abc             2017-11-29 aaa         High  
#> 2 abc             2018-01-11 bbb         High  
#> 3 abc             2018-01-12 bbb         Low   
#> 4 abc             2018-11-27 ccc         Unsure
#> 5 xyz             2019-03-22 bla bla     High  
#> 6 xyz             2019-03-23 bla bla     Low

^{Created on 2019-10-18 by the reprex package (v0.3.0)}

Here, the custom summary function says "if there's only one distinct value, use that value, otherwise say we're unsure what the value is". You could write any summary function you wanted to, really, or use paste() with the collapse argument set to "roll up" all the character values in to a single item.

Slavek · October 22, 2019, 9:21am

Absolutely brilliant!!! Thank you

jim89 · October 22, 2019, 5:58pm

No worries.

Because I didn't mention it before, I don't know if this is actually a very good idea, though. Yes you can do it, but it's letting you ignore/get away with not resolving the duplication in your data, which I don't think is a good idea.

In this case it might be absolutely fine (I don't know the full context of your data/analysis), but in general I'd caution against using a convenient solution to a problem like this (which doesn't have a particularly well defined solution), rather than figuring out where those duplicates are coming from in the first place, and figuring out how to treat them more "properly".

Slavek · October 23, 2019, 8:04am

Thank you for your note. I'll do some testing based on a bigger sample and see if that works

jim89 · October 24, 2019, 7:25pm

Ok. Just to be clear though, my point isn't about whether or not this method will work on larger data, my point is that you need to be careful about ignoring/hiding duplicates using summaries like this; there's a bit of a risk that it will come back to bite you later in the project!

system · October 31, 2019, 7:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.