Using distinct but ignoring case

tmauch · November 16, 2019, 12:25am

I have a data frame, DF, with column B, where the values in B are a mix of numbers and letters and some other punctuation. Sometimes the letters are lower case, upper case, or both.

B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF

                      B       D    E
1 10.1056/NEJMOA1505467 Paywall 2015
2 10.1056/NEJMoa1505467 Paywall 2012
3 10.1056/nejmoa1508375 Paywall 2010
4 10.1056/NEJMOA1508375 Paywall 2011

I'm trying to identify duplicate values in B by using group_by and mutate and then get rid of rows with duplicate values in B using distinct. But because the cases aren't the same, group_by and distinct don't count them as being the same.


DF <- DF %>% 
  group_by(B) %>% 
  mutate(BCount = n())

DF

 
# A tibble: 4 x 4
# Groups:   B [4]
  B                     D           E BCount
  <fct>                 <fct>   <dbl>  <int>
1 10.1056/NEJMOA1505467 Paywall  2015      1
2 10.1056/NEJMoa1505467 Paywall  2012      1
3 10.1056/nejmoa1508375 Paywall  2010      1
4 10.1056/NEJMOA1508375 Paywall  2011      1


DF <- distinct(DF, B, .keep_all = TRUE)
DF

# A tibble: 4 x 4
# Groups:   B [4]
  B                     D           E BCount
  <fct>                 <fct>   <dbl>  <int>
1 10.1056/NEJMOA1505467 Paywall  2015      1
2 10.1056/NEJMoa1505467 Paywall  2012      1
3 10.1056/nejmoa1508375 Paywall  2010      1
4 10.1056/NEJMOA1508375 Paywall  2011      1

I've tried using tolower and toupper to get the letters all the same, but this doesn't seem to change the values in my data frame, it seems to create a new vector? I want to keep my data frame and just convert the text (one way or the other, it doesn't matter). What am I getting wrong?

technocrat · November 16, 2019, 1:03am

What you can do is shown in the following reproducible example, called a reprex

(Note that BCount is missing, because I pasted the code in your first block).

This reprex below fixes the mixed case problem, but it doesn't further your goal of eliminating duplicated values of B with group_by and distinct. While the rows may have identical B and D, they vary in E. What's the decision rule to choose between the earlier and later dates?

You can, however, nest, keep both values of E and decide later how you want to unpack them.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF
#>                       B       D    E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMoa1505467 Paywall 2012
#> 3 10.1056/nejmoa1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_upper <- DF %>% mutate(B = str_to_upper(B))
DF_upper
#>                       B       D    E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMOA1505467 Paywall 2012
#> 3 10.1056/NEJMOA1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_nest <- DF_upper %>% group_by(B) %>% nest()
DF_nest
#> # A tibble: 2 x 2
#> # Groups:   B [2]
#>   B                               data
#>   <chr>                 <list<df[,2]>>
#> 1 10.1056/NEJMOA1505467        [2 × 2]
#> 2 10.1056/NEJMOA1508375        [2 × 2]
DF_nest$data
#> <list_of<
#>   tbl_df<
#>     D: factor<6edeb>
#>     E: double
#>   >
#> >[2]>
#> [[1]]
#> # A tibble: 2 x 2
#>   D           E
#>   <fct>   <dbl>
#> 1 Paywall  2015
#> 2 Paywall  2012
#> 
#> [[2]]
#> # A tibble: 2 x 2
#>   D           E
#>   <fct>   <dbl>
#> 1 Paywall  2010
#> 2 Paywall  2011

^{Created on 2019-11-15 by the reprex package (v0.3.0)}

Pritishd · November 16, 2019, 5:49am

Hi tmauch,

Below code should help you. Let me know if you face any issues.

I first change the case to upper using the toupper function
Next, I use the function duplicated to identify duplicates. Remember that duplicated identifies the latest duplicate instant in a set of values. For e.g. In case of the first 2 values from DF the function will identify the second duplicate which corresponds to year 2012. You can reverse it by setting the argument fromLast=TRUE in the code below
I finally remove these duplicates using filter function and remove the two helper columns using select function

B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF

library(dplyr)

DF<-mutate(DF,
             G=toupper(B),
             "dupl_row"=duplicated(G,fromLast=FALSE)) %>%
    filter(dupl_row!=TRUE) %>%
    select(-G,-dupl_row)

print(DF)

Below is how the output of this code looks like:

Case :: A (Setting fromLast=FALSE)

                      B       D    E
1 10.1056/NEJMOA1505467 Paywall 2015
2 10.1056/nejmoa1508375 Paywall 2010

Case :: B (Setting fromLast=TRUE)

                     B       D    E
1 10.1056/NEJMoa1505467 Paywall 2012
2 10.1056/NEJMOA1508375 Paywall 2011

Warm Regards,
Pritish

system · December 7, 2019, 5:49am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.