I'm wondering if there is a more efficient way of doing the following: I have a data frame N rows but only M of those rows are unique. I want to generate a new data frame with an uniqueID variable and the corresponding count of rows . I can do this as follows:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T),
B = sample(letters, 10000, T),
C = sample(1:5, 10000, T))
test.df %>% mutate(uniqueID = paste0(A,B,C)) %>% group_by(uniqueID) %>%
summarise(n = n()) %>% arrange(-n)
#> # A tibble: 3,206 x 2
#> uniqueID n
#> <chr> <int>
#> 1 Jg2 11
#> 2 Vz3 10
#> 3 Ao2 9
#> 4 Aq2 9
#> 5 Cv3 9
#> 6 Ee5 9
#> 7 Fj5 9
#> 8 Jw4 9
#> 9 Mk3 9
#> 10 Po1 9
#> # ... with 3,196 more rows
There are probably multiple ways to do it, but here is one of them:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T),
B = sample(letters, 10000, T),
C = sample(1:5, 10000, T))
test.df %>%
mutate(uniqueID = paste0(!!!rlang::syms(names(test.df)))) %>%
add_count()
#> # A tibble: 10,000 x 5
#> A B C uniqueID n
#> <fct> <fct> <int> <chr> <int>
#> 1 X x 1 Xx1 10000
#> 2 T x 1 Tx1 10000
#> 3 U s 1 Us1 10000
#> 4 Z j 5 Zj5 10000
#> 5 H l 2 Hl2 10000
#> 6 M r 1 Mr1 10000
#> 7 X z 3 Xz3 10000
#> 8 W j 2 Wj2 10000
#> 9 E k 4 Ek4 10000
#> 10 A y 5 Ay5 10000
#> # … with 9,990 more rows
Created on 2019-02-11 by the reprex package (v0.2.1)
You can then filter out all the rows with n > 1 to get only unique.
Anytime you see code including the sequence group_by then summarize(n = n()), you can use count() instead. It helpfully does the this sequence for you, has a built in argument named sort which does the arrange() step for you, and it also does the ungroup() step, which you don't show but would need to do any further operations on this tibble.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T),
B = sample(letters, 10000, T),
C = sample(1:5, 10000, T))
test.df %>%
count(A, B, C, sort = TRUE)
#> # A tibble: 3,206 x 4
#> A B C n
#> <fct> <fct> <int> <int>
#> 1 J g 2 11
#> 2 V z 3 10
#> 3 A o 2 9
#> 4 A q 2 9
#> 5 C v 3 9
#> 6 E e 5 9
#> 7 F j 5 9
#> 8 J w 4 9
#> 9 M k 3 9
#> 10 P o 1 9
#> # … with 3,196 more rows
# the name argument currently is only possible in the development version of dplyr
# devtools::install_github("tidyverse/dplyr@rc_0.8.0")
test.df %>%
count(A, B, C, sort = TRUE, name = "unique_count")
#> # A tibble: 3,206 x 4
#> A B C unique_count
#> <fct> <fct> <int> <int>
#> 1 J g 2 11
#> 2 V z 3 10
#> 3 A o 2 9
#> 4 A q 2 9
#> 5 C v 3 9
#> 6 E e 5 9
#> 7 F j 5 9
#> 8 J w 4 9
#> 9 M k 3 9
#> 10 P o 1 9
#> # … with 3,196 more rows