I have a data frame, DF, with column B, where the values in B are a mix of numbers and letters and some other punctuation. Sometimes the letters are lower case, upper case, or both.
B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF
B D E
1 10.1056/NEJMOA1505467 Paywall 2015
2 10.1056/NEJMoa1505467 Paywall 2012
3 10.1056/nejmoa1508375 Paywall 2010
4 10.1056/NEJMOA1508375 Paywall 2011
I'm trying to identify duplicate values in B by using group_by and mutate and then get rid of rows with duplicate values in B using distinct. But because the cases aren't the same, group_by and distinct don't count them as being the same.
DF <- DF %>%
group_by(B) %>%
mutate(BCount = n())
DF
# A tibble: 4 x 4
# Groups: B [4]
B D E BCount
<fct> <fct> <dbl> <int>
1 10.1056/NEJMOA1505467 Paywall 2015 1
2 10.1056/NEJMoa1505467 Paywall 2012 1
3 10.1056/nejmoa1508375 Paywall 2010 1
4 10.1056/NEJMOA1508375 Paywall 2011 1
DF <- distinct(DF, B, .keep_all = TRUE)
DF
# A tibble: 4 x 4
# Groups: B [4]
B D E BCount
<fct> <fct> <dbl> <int>
1 10.1056/NEJMOA1505467 Paywall 2015 1
2 10.1056/NEJMoa1505467 Paywall 2012 1
3 10.1056/nejmoa1508375 Paywall 2010 1
4 10.1056/NEJMOA1508375 Paywall 2011 1
I've tried using tolower and toupper to get the letters all the same, but this doesn't seem to change the values in my data frame, it seems to create a new vector? I want to keep my data frame and just convert the text (one way or the other, it doesn't matter). What am I getting wrong?
(Note that BCount is missing, because I pasted the code in your first block).
This reprex below fixes the mixed case problem, but it doesn't further your goal of eliminating duplicated values of B with group_by and distinct. While the rows may have identical B and D, they vary in E. What's the decision rule to choose between the earlier and later dates?
You can, however, nest, keep both values of E and decide later how you want to unpack them.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
B = c("10.1056/NEJMOA1505467", "10.1056/NEJMoa1505467", "10.1056/nejmoa1508375", "10.1056/NEJMOA1508375")
D = c("Paywall", "Paywall", "Paywall", "Paywall")
E = c(2015, 2012, 2010, 2011)
DF = data.frame(B, D, E)
DF
#> B D E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMoa1505467 Paywall 2012
#> 3 10.1056/nejmoa1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_upper <- DF %>% mutate(B = str_to_upper(B))
DF_upper
#> B D E
#> 1 10.1056/NEJMOA1505467 Paywall 2015
#> 2 10.1056/NEJMOA1505467 Paywall 2012
#> 3 10.1056/NEJMOA1508375 Paywall 2010
#> 4 10.1056/NEJMOA1508375 Paywall 2011
DF_nest <- DF_upper %>% group_by(B) %>% nest()
DF_nest
#> # A tibble: 2 x 2
#> # Groups: B [2]
#> B data
#> <chr> <list<df[,2]>>
#> 1 10.1056/NEJMOA1505467 [2 × 2]
#> 2 10.1056/NEJMOA1508375 [2 × 2]
DF_nest$data
#> <list_of<
#> tbl_df<
#> D: factor<6edeb>
#> E: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 2 x 2
#> D E
#> <fct> <dbl>
#> 1 Paywall 2015
#> 2 Paywall 2012
#>
#> [[2]]
#> # A tibble: 2 x 2
#> D E
#> <fct> <dbl>
#> 1 Paywall 2010
#> 2 Paywall 2011
Below code should help you. Let me know if you face any issues.
I first change the case to upper using the toupper function
Next, I use the function duplicated to identify duplicates. Remember that duplicated identifies the latest duplicate instant in a set of values. For e.g. In case of the first 2 values from DF the function will identify the second duplicate which corresponds to year 2012. You can reverse it by setting the argument fromLast=TRUE in the code below
I finally remove these duplicates using filter function and remove the two helper columns using select function