# Resampling Groups of Data?

I am working with the R programing language.

I have the following data on a set of students repeatedly flipping a (potentially correlated/biased) coin :

``````library(tidyverse)

set.seed(123)
ids <- 1:100
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)

for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}

#tidy up
my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data\$student_id),]

final <- my_data %>%
group_by(student_id) %>%
mutate(flip_number = row_number())
``````

My Question: Using this data, I want to perform the following procedure:

• Step 1: Randomly sample (with replacement) 100 student id's from `final` (e.g. resample_id_1 = student_1, resample_id_2 = student_54,resample_id_3 = student_23, resample_id_4 = student_54, etc. )
• Step 2: For each resample_id, select all rows of data for that student - if this student appears multiple times, this student's data will also appear multiple times
• Step 3: For each resample_id, count the number of times the coin went from HH, HT, TH, and TT (make sure not to count a transition occurring between the last row of resample_id_n and resample_id_n+1) . Store these results.
• Step 4: Repeat Step 1 - Step 3 many times

This is my attempt at doing this:

``````set.seed(123)
ids <- 1:100

library(dplyr)

results <- list()

for (j in 1:100) {
selected_ids <- sample(ids, 100, replace = TRUE)

resampled_data <- data.frame()

for (i in 1:length(selected_ids)) {
current_id <- selected_ids[i]
current_data <- final %>% filter(student_id == current_id)
current_data\$resampled_id <- i
resampled_data <- rbind(resampled_data, current_data)
}

current_result <- resampled_data %>%
group_by(resampled_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)

results[[j]] <- current_result
}
``````

My Question: Apart from taking a long time to run, I am not sure if I am doing this correctly. I am worried that within a given iteration, if the same student appears 3 times in the re-sampled dataset, the last transition from the first time will "leak" into the the first transition from the second time and thus compromise the results.

Thanks!

Note: Optional Code to Visualize the Results

``````final_result <- data.frame(iteration = 1:100, HH = numeric(100), HT = numeric(100), TH = numeric(100), TT = numeric(100))

for (i in 1:100) {
current_result <- results[[i]]
total_count <- sum(current_result\$n)
final_result\$HH[i] <- current_result\$n[current_result\$Sequence == "HH"] / total_count
final_result\$HT[i] <- current_result\$n[current_result\$Sequence == "HT"] / total_count
final_result\$TH[i] <- current_result\$n[current_result\$Sequence == "TH"] / total_count
final_result\$TT[i] <- current_result\$n[current_result\$Sequence == "TT"] / total_count
}

library(ggplot2)

final_result_long <- final_result %>%
pivot_longer(cols = c(HH, HT, TH, TT), names_to = "Sequence", values_to = "Probability")

ggplot(final_result_long, aes(x = iteration, y = Probability, color = Sequence)) +
geom_line()
``````

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.