What you can do is shown in the following reproducible example, called a reprex
(Note that BCount is missing, because I pasted the code in your first block).
This reprex
below fixes the mixed case problem, but it doesn't further your goal of eliminating duplicated values of B
with group_by
and distinct
. While the rows may have identical B
and D
, they vary in E
. What's the decision rule to choose between the earlier and later dates?
You can, however, nest
, keep both values of E
and decide later how you want to unpack them.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF
#> B D E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMoa1505467 Paywall 2012
#> 3 10.1056/nejmoa1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_upper <- DF %>% mutate(B = str_to_upper(B))
DF_upper
#> B D E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMOA1505467 Paywall 2012
#> 3 10.1056/NEJMOA1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_nest <- DF_upper %>% group_by(B) %>% nest()
DF_nest
#> # A tibble: 2 x 2
#> # Groups: B [2]
#> B data
#> <chr> <list<df[,2]>>
#> 1 10.1056/NEJMOA1505467 [2 × 2]
#> 2 10.1056/NEJMOA1508375 [2 × 2]
DF_nest$data
#> <list_of<
#> tbl_df<
#> D: factor<6edeb>
#> E: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 2 x 2
#> D E
#> <fct> <dbl>
#> 1 Paywall 2015
#> 2 Paywall 2012
#>
#> [[2]]
#> # A tibble: 2 x 2
#> D E
#> <fct> <dbl>
#> 1 Paywall 2010
#> 2 Paywall 2011
Created on 2019-11-15 by the reprex package (v0.3.0)