Hi all:
Using the tidyverse, how might I randomly sample without replacement the iris
data frame, grouped by Species
, so that I end up with 50 unique groups of 3 where each species is represented?
In other words, what's the best approach to repeat the chunk below 50 times without replacement so that each combination of three is unique?
iris %>%
group_by(Species) %>%
sample_n(1)
davis
August 15, 2018, 4:45pm
2
There might be a better way, but my first thought was stratified vfold crossvalidation using rsample
, where you only care about the assessment set.
library(dplyr)
library(rsample)
library(purrr)
library(tidyr)
ex <- vfold_cv(iris, v = 50, strata = "Species")
# Extract only the assessment set
ex$resamples <- map(ex$splits, assessment)
ex
#> # 50-fold cross-validation using stratification
#> # A tibble: 50 x 3
#> splits id resamples
#> <list> <chr> <list>
#> 1 <S3: rsplit> Fold01 <data.frame [3 x 5]>
#> 2 <S3: rsplit> Fold02 <data.frame [3 x 5]>
#> 3 <S3: rsplit> Fold03 <data.frame [3 x 5]>
#> 4 <S3: rsplit> Fold04 <data.frame [3 x 5]>
#> 5 <S3: rsplit> Fold05 <data.frame [3 x 5]>
#> 6 <S3: rsplit> Fold06 <data.frame [3 x 5]>
#> 7 <S3: rsplit> Fold07 <data.frame [3 x 5]>
#> 8 <S3: rsplit> Fold08 <data.frame [3 x 5]>
#> 9 <S3: rsplit> Fold09 <data.frame [3 x 5]>
#> 10 <S3: rsplit> Fold10 <data.frame [3 x 5]>
#> # ... with 40 more rows
unnest(ex, resamples)
#> # A tibble: 150 x 6
#> id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 Fold01 5 3 1.6 0.2 setosa
#> 2 Fold01 5.5 2.4 3.7 1 versicolor
#> 3 Fold01 6.7 3.3 5.7 2.1 virginica
#> 4 Fold02 4.9 3.6 1.4 0.1 setosa
#> 5 Fold02 5.8 2.7 3.9 1.2 versicolor
#> 6 Fold02 6.4 3.1 5.5 1.8 virginica
#> 7 Fold03 5.1 3.5 1.4 0.3 setosa
#> 8 Fold03 6.8 2.8 4.8 1.4 versicolor
#> 9 Fold03 7.2 3 5.8 1.6 virginica
#> 10 Fold04 5.1 3.3 1.7 0.5 setosa
#> # ... with 140 more rows
# Extract the row IDs that each assessment set corresponds to
ex$row_ids <- map(ex$splits, ~setdiff(1:150, .x$in_id))
# Still 150 rows, confirms that the V fold cross validation
# was without replacement
ex %>%
unnest(resamples, row_ids) %>%
distinct(row_ids)
#> # A tibble: 150 x 1
#> row_ids
#> <int>
#> 1 26
#> 2 82
#> 3 125
#> 4 38
#> 5 83
#> 6 138
#> 7 18
#> 8 77
#> 9 130
#> 10 24
#> # ... with 140 more rows
Created on 2018-08-15 by the reprex package (v0.2.0).
2 Likes