slice_sample() behaving weird on my data

LauradJ · November 29, 2021, 6:04pm

Hey,
I'm wondering why slice_sample() is acting weird on my data. It converts certain count variables into decimal numbers. The variables are numeric, but slice_sample() doesn't do the same with mtcars which also has numeric variables (not integers). I can't give a full reprex I'm afraid, because using mtcars doesn't give the same problem. I know I can probably fix it when I convert my data to integers, I'm just surprised and wondering why it behaves like this.

> mtcars %>% slice_sample(prop = 0.1)
               mpg cyl  disp  hp drat    wt qsec vs am gear carb
Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
> str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...


> data_clean %>%
+   slice_sample(., prop = 0.1)
   match_id sequence_id team_id_both duration_tot_seq event_start unique_plyr_both setplay team_id_team_in_pos
1  907804.5   234.32463    12477.513         5.000000   13.000000         2.837684       1           12477.513
2  907802.2   218.97067    11482.000         7.058666   11.705777         3.000000       1           11482.000
3  908883.2    98.25214    12477.000         2.565170   14.252136         2.313034       1           12477.000
4  908919.0   205.00000     9919.000         9.000000   13.000000         3.000000       1            9919.000
5  908849.0   216.00000    12478.000         8.000000    3.000000         4.000000       1           12478.000
6  908939.0   275.00000    12479.000        21.000000    3.000000         5.000000       1           12479.000

> str(data_clean)
'data.frame':	178784 obs. of  47 variables:
 $ match_id               : num  907784 907784 907784 907784 907784 ...
 $ sequence_id            : num  2 3 4 5 6 7 8 12 13 14 ...
 $ team_id_both           : num  12474 12479 12474 12479 12474 ...
 $ duration_tot_seq       : num  7 23 4 6 4 2 4 16 22 12 ...
 $ event_start            : num  17 10 3 13 3 13 13 3 6 13 ...
 $ unique_plyr_both       : num  2 4 2 2 1 3 2 5 4 6 ...
 $ setplay                : num  1 1 1 1 1 1 1 1 1 1 ...
 $ team_id_team_in_pos    : num  12474 12479 12474 12479 12474 ...

session info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Netherlands.1252  LC_CTYPE=English_Netherlands.1252    LC_MONETARY=English_Netherlands.1252
[4] LC_NUMERIC=C                         LC_TIME=English_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] factoextra_1.0.7 cluster_2.1.2    forcats_0.5.1    stringr_1.4.0    dplyr_1.0.7      purrr_0.3.4     
 [7] readr_2.0.1      tidyr_1.1.3      tibble_3.1.3     ggplot2_3.3.5    tidyverse_1.3.1 

loaded via a namespace (and not attached):
 [1] bslib_0.2.5.1     tidyselect_1.1.1  xfun_0.25         haven_2.4.3       colorspace_2.0-2  vctrs_0.3.8      
 [7] generics_0.1.0    htmltools_0.5.1.1 yaml_2.2.1        utf8_1.2.2        rlang_0.4.11      jquerylib_0.1.4  
[13] pillar_1.6.2      glue_1.4.2        withr_2.4.2       DBI_1.1.1         dbplyr_2.1.1      modelr_0.1.8     
[19] readxl_1.3.1      audio_0.1-8       lifecycle_1.0.0   munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
[25] rvest_1.0.1       evaluate_0.14     knitr_1.33        tzdb_0.1.2        fansi_0.5.0       broom_0.7.9      
[31] Rcpp_1.0.7        scales_1.1.1      backports_1.2.1   jsonlite_1.7.2    fs_1.5.0          digest_0.6.27    
[37] hms_1.1.0         stringi_1.7.3     ggrepel_0.9.1     grid_4.1.1        cli_3.0.1         tools_4.1.1      
[43] sass_0.4.0        magrittr_2.0.1    beepr_1.3         crayon_1.4.1      pkgconfig_2.0.3   ellipsis_0.3.2   
[49] xml2_1.3.2        reprex_2.0.1      lubridate_1.7.10  rmarkdown_2.11    assertthat_0.2.1  httr_1.4.2       
[55] rstudioapi_0.13   R6_2.5.1          compiler_4.1.1

HanOostdijk · November 29, 2021, 11:06pm

Hello @LauradJ ,

this should not be the case. Looking in the code of dplyr:::slice_sample.data.frame one sees that the slice function is used with indices generated by dplyr:::sample_int which is a variant of sample.int .
So there is no calculation of rows, only the selection of certain of them.

This leads me to think that the non-integer numbers (where you expect integer numbers) are already in data_clean before the slice_sample step. In other words: I expect that data_clean has indeed a row with match_id (around) 907804.5 and a sequence_id of about 234.32463 .

Be aware that the figures show can be rounded.
A way to check my assumption is to sort on match_id and filter on a range around 907804.5 .

nirgrahamuk · November 30, 2021, 9:01am

My fellow poster has given you excellent insight and advice.
In addition to that I must note that

data_clean %>% slice_sample(., prop = 0.1)

is equivalent to

slice_sample( data_clean, data_clean, prop = 0.1)

Which is probably not desired

system · December 21, 2021, 9:02am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.