Duplicates with conditions

Hi,
I'm trying to subset unique ReferenceNumbers where Dates and Decstriptions are the same.

I have prepared this simple example

library(dplyr)

My.data <- data.frame(stringsAsFactors=FALSE,
   ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
              Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                       "2018-01-11", "2018-01-12", "2018-11-27", "2018-11-27"),
       Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                       "ccc")
)
My.data

I know how to remove duplicates based on one variable:

My.data.New <- My.data[!duplicated(My.data$ReferenceNumber), ]
My.data.New

...but I would like to remove only records with same RerferenceNumber, same Date and same Description (so as a result I should have 6 records).

Can you help please?

There are probably more elegant ways to do this, but an easy solution that comes into my mind is combining all 3 columns to 1, e.g. with paste() or str_c(), and checking this for duplicates.

I'm also adding one more variable (Cost):


My.data <- 
  data.frame(stringsAsFactors=FALSE,
             ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
             Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                      "2018-01-11", "2018-01-12", "2018-11-27",
                      "2018-11-27"),
             Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                             "ccc"),
             Cost = c(25, 26, 26, 5, 12, 14, 14, 14)
  )

and I want this value be averaged for ReferenceNumber before removing duplicates so value for abc in row 4 and 4 should be 15.5 before removing one of duplicates, the same relates to last two rows where average should be 14.
How can I do that?

When you're thinking about duplicates and want just the unique set of values back, you can use dplyr::distinct() to do so:

library(dplyr)
distinct(My.data, ReferenceNumber, Date, Description)

And a trick to remember is, when you want a summary value/statistic for different combinations of other values in your data, what you're really talking about is group_by(), for example, to calculate the average cost for each combination of ReferenceNumber, Date, and Description in your data you could use:

My.data %>% 
    group_by(ReferenceNumber, Date, Description) %>% 
    summarise(Cost = mean(Cost))

# A tibble: 6 x 4
# Groups:   ReferenceNumber, Date [6]
  ReferenceNumber Date       Description  Cost
  <chr>           <chr>      <chr>       <dbl>
1 abc             2017-11-29 aaa          15.5
2 abc             2018-01-11 bbb          12  
3 abc             2018-01-12 bbb          14  
4 abc             2018-11-27 ccc          14  
5 xyz             2019-03-22 bla bla      25  
6 xyz             2019-03-23 bla bla      26  

If you want the average for a different set of variables, change what you put in group_by(), or if you want to add an average as a new, repeated variable (rather than summarising and reducing the number of rows in your data), you can use mutate() after group_by(), rather than summarise():

My.data %>% 
    group_by(ReferenceNumber, Date, Description) %>% 
    mutate(Cost = mean(Cost))

# A tibble: 8 x 4
# Groups:   ReferenceNumber, Date, Description [6]
  ReferenceNumber Date       Description  Cost
  <chr>           <chr>      <chr>       <dbl>
1 xyz             2019-03-22 bla bla      25  
2 xyz             2019-03-23 bla bla      26  
3 abc             2017-11-29 aaa          15.5
4 abc             2017-11-29 aaa          15.5
5 abc             2018-01-11 bbb          12  
6 abc             2018-01-12 bbb          14  
7 abc             2018-11-27 ccc          14  
8 abc             2018-11-27 ccc          14  
1 Like

Thank you very much for this full explanation of all possible options.
Excellent!!! :clap:

I have one more question if you don't mind.
Both solutions removing duplicates above (group_by and distinct) create outputs only with variables mentioned in their codes.
My real data frame is significantly bigger.
I am adding two more variables to the data frame:

My.data2 <- 
  data.frame(stringsAsFactors=FALSE,
             ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
             Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                      "2018-01-11", "2018-01-12", "2018-11-27",
                      "2018-11-27"),
             Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                             "ccc"),
             Cost = c(25, 26, 26, 5, 12, 14, 14, 14),
             Cost2 = c(225, 6, 6, 51, 2, 4, 1, 14),
             NewVar = c(2, 6, 6, 5, 1, 1, 4, 4)
  )

When I run your codes on My.data2, all results I see are without "NewVar" and "Cost2". Is any way to keep them so koo keep the structure of the original data frame?

Add them to distinct() or group_by() so that they get included in your de-deduplication. They're look like they're also duplicated so it probably makes sense to do so here.

But in general if you have duplicates across just some of the variables in your data (and not others) you'll need to think/plan how you want to deal with the others, which depends on your data and I'm not sure I can give you a generic "do X" answer for that, as it's very dependant on your data and what you're trying to do with them!

Let's say the "NewVar" will be used to link this data frame with something else or will be used for further analysis, tabulations etc. I need to keep it and I don't know how I could do that...

Is NewVar also duplicated along with ReferenceNumber, Date etc? I'm going to guess that if you need it to link to another data frame, then it probably is also duplicated, but it's not clear from your question.

If it is duplicated, you can either include it in distinct() or group_by() to include it in your de-deduplication. If not, you need to think about if it makes sense to summarise NewVar and code accordingly.

1 Like

Absolutely perfect! Thank you :smile:

Hi,
I know you have kindly fully answered my question but I hope you don't mind if I ask you one more thing.
Means work perfectly for Numerical variables but is it possible to do something similar with string variables?
Is it possible to keep original string value if they are the same and new (for example "Unsure") if they are different for duplicated records?

My.data <- 
  data.frame(stringsAsFactors=FALSE,
             ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
             Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29",
                      "2018-01-11", "2018-01-12", "2018-11-27",
                      "2018-11-27"),
             Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc",
                             "ccc"),
             Cost = c("High", "Low", "High", "High","High", "Low", "High", "Low"),
             Cost2 = c(225, 6, 6, 51, 2, 4, 1, 14),
             NewVar = c(2, 6, 6, 5, 1, 1, 4, 4)
  )
My.data

My.data.New <- My.data %>% 
  group_by(ReferenceNumber, Date, Description) %>% 
  summarise(Cost = "Unsure")
My.data.New

Obviously my solution is not working but I don't know if this operation is possible at all?
As a result, rows 3 and 4 should have "High", last two rows "Unsure" and other descriptions should stay unchanged...

There's not as many pre-defined summary functions for text/string data (e.g. mean(), median() etc don't make sense). But you could write your own custom summary function if you have some specific behaviour you want to use, e.g.:

library(tidyverse)

My.data <- data.frame(
    stringsAsFactors=FALSE,
    ReferenceNumber = c("xyz", "xyz", "abc", "abc", "abc", "abc", "abc", "abc"),
    Date = c("2019-03-22", "2019-03-23", "2017-11-29", "2017-11-29", "2018-01-11", "2018-01-12", "2018-11-27","2018-11-27"),
    Description = c("bla bla", "bla bla", "aaa", "aaa", "bbb", "bbb", "ccc", "ccc"),
    Cost = c("High", "Low", "High", "High","High", "Low", "High", "Low"),
    Cost2 = c(225, 6, 6, 51, 2, 4, 1, 14),
    NewVar = c(2, 6, 6, 5, 1, 1, 4, 4)
)

custom_string_summary <- function(string_vec) {
    ifelse(length(unique(string_vec)) == 1, unique(string_vec), "Unsure")
}

My.data %>% 
  group_by(ReferenceNumber, Date, Description) %>% 
  summarise(Cost = custom_string_summary(Cost))
#> # A tibble: 6 x 4
#> # Groups:   ReferenceNumber, Date [6]
#>   ReferenceNumber Date       Description Cost  
#>   <chr>           <chr>      <chr>       <chr> 
#> 1 abc             2017-11-29 aaa         High  
#> 2 abc             2018-01-11 bbb         High  
#> 3 abc             2018-01-12 bbb         Low   
#> 4 abc             2018-11-27 ccc         Unsure
#> 5 xyz             2019-03-22 bla bla     High  
#> 6 xyz             2019-03-23 bla bla     Low

Created on 2019-10-18 by the reprex package (v0.3.0)

Here, the custom summary function says "if there's only one distinct value, use that value, otherwise say we're unsure what the value is". You could write any summary function you wanted to, really, or use paste() with the collapse argument set to "roll up" all the character values in to a single item.

2 Likes

Absolutely brilliant!!! Thank you :smile:

No worries.

Because I didn't mention it before, I don't know if this is actually a very good idea, though. Yes you can do it, but it's letting you ignore/get away with not resolving the duplication in your data, which I don't think is a good idea.

In this case it might be absolutely fine (I don't know the full context of your data/analysis), but in general I'd caution against using a convenient solution to a problem like this (which doesn't have a particularly well defined solution), rather than figuring out where those duplicates are coming from in the first place, and figuring out how to treat them more "properly".

Thank you for your note. I'll do some testing based on a bigger sample and see if that works :smile:

Ok. Just to be clear though, my point isn't about whether or not this method will work on larger data, my point is that you need to be careful about ignoring/hiding duplicates using summaries like this; there's a bit of a risk that it will come back to bite you later in the project!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.