Using distinct but ignoring case

technocrat · November 16, 2019, 1:03am

What you can do is shown in the following reproducible example, called a reprex

(Note that BCount is missing, because I pasted the code in your first block).

This reprex below fixes the mixed case problem, but it doesn't further your goal of eliminating duplicated values of B with group_by and distinct. While the rows may have identical B and D, they vary in E. What's the decision rule to choose between the earlier and later dates?

You can, however, nest, keep both values of E and decide later how you want to unpack them.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF
#>                       B       D    E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMoa1505467 Paywall 2012
#> 3 10.1056/nejmoa1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_upper <- DF %>% mutate(B = str_to_upper(B))
DF_upper
#>                       B       D    E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMOA1505467 Paywall 2012
#> 3 10.1056/NEJMOA1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_nest <- DF_upper %>% group_by(B) %>% nest()
DF_nest
#> # A tibble: 2 x 2
#> # Groups:   B [2]
#>   B                               data
#>   <chr>                 <list<df[,2]>>
#> 1 10.1056/NEJMOA1505467        [2 × 2]
#> 2 10.1056/NEJMOA1508375        [2 × 2]
DF_nest$data
#> <list_of<
#>   tbl_df<
#>     D: factor<6edeb>
#>     E: double
#>   >
#> >[2]>
#> [[1]]
#> # A tibble: 2 x 2
#>   D           E
#>   <fct>   <dbl>
#> 1 Paywall  2015
#> 2 Paywall  2012
#> 
#> [[2]]
#> # A tibble: 2 x 2
#>   D           E
#>   <fct>   <dbl>
#> 1 Paywall  2010
#> 2 Paywall  2011

^{Created on 2019-11-15 by the reprex package (v0.3.0)}