Systematic sampling from dataframe

Stubb · March 6, 2022, 3:02pm

Hi

I'm trying to make a systematic sample from a dataframe thats generated by several loops I'm running. Meaning the number of rows/observations will vary from each iteration. The dataframe consists of four variables/columns:
"ID" "Width" "Length" "Rank"

The "Rank" ranges from 1 to N. This variable is supposed to represent the order of my observations and is what I want to samply by. The goal is to select every n'th observation according to rank. The sample size is gonna vary by the "for" loop im running (samplesize=4,6,8,10).

Example: If the current dataframe consists of 20 observations and Im in the samplesize=4 part of the "for" loop. "Rank" will then be 1,2,3,4,(...),20. The n'th selection will be 20/4=5 = every 5th observation.
Id then like to make a new dataframe with every 5th observation (according to "Rank").

Any ideas on how to set something like that up? This might be really easy but Im a bit stuck.
Thanks for any help you might provide!

StatSteph · March 6, 2022, 8:26pm

OK, if you want to take a random sample of n objects from a population size of N using a systematic sample, you'll need to 1) calculate the sampling interval and then 2) choose a random starting point. This is done in the function created below and then applied to a data example.

set.seed(12345) # we will all get the same random samples by setting a seed

get_sys_indicator <- function(N,n){
  k = ceiling(N/n) # sampling interval
  r = sample(1:k, 1) # random starting point
  seq(r, r + k*(n-1), k) #  this gives you a vector of the indices to select
}

mydat <- data.frame(
  ID=letters[1:20],
  Width=rlnorm(20),
  Length=rlnorm(20),
  Rank=1:20
)

head(mydat)
#>   ID     Width    Length Rank
#> 1  a 1.7959405 2.1806477    1
#> 2  b 2.0329054 4.2878485    2
#> 3  c 0.8964585 0.5250150    3
#> 4  d 0.6354021 0.2115831    4
#> 5  e 1.8328781 0.2023595    5
#> 6  f 0.1623573 6.0805644    6
(sampleindex <- get_sys_indicator(20, 4))
#> [1]  3  8 13 18
mysample <- mydat[mydat$Rank %in% sampleindex, ]
mysample
#>    ID     Width    Length Rank
#> 3   c 0.8964585 0.5250150    3
#> 8   h 0.7586732 1.8596342    8
#> 13  m 1.4486439 7.7616143   13
#> 18  r 0.7177905 0.1897495   18

^{Created on 2022-03-06 by the reprex package (v2.0.1)}

Stubb · March 6, 2022, 9:29pm

Thank you so much This was exactly what I was trying to do.

system · March 27, 2022, 9:29pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.