I am trying to conduct permutation test by year. In the following example, I only did it for a particular year. Rather than manually conducting the test for each year, how could I could I implement it for all the available years? The goal is to have a data frame with year, number of male obs, number of female obs, observed difference, and p-values.
suppressWarnings(suppressMessages(library(tidyverse)))
################
# Complete Data
################
df <- tibble(
delay = c(320, 480, 290, 350, 180, 260, 220, 325, 485, 295, 355, 185, 265, 225),
gender = rep(c("F", "M", "F", "M"), c(4,3,5,2)),
year = rep(c(2020, 2021), each = 7)
)
#########################################################################
# PERMUTATION TEST FOR 2020
#########################################################################
# H0: mu_female = mu_male
# HA: mu_female > mu_male
# Data 2020
df2020 <- df %>%
filter(year == 2020)
# Step 1:
# A: Pull all delays
delay_all_2020 <- df2020 %>% pull(delay)
length(delay_all_2020)
#> [1] 7
# B: Female Delay
delay_female_2020 <- df2020 %>%
filter(gender == "F") %>%
pull(delay)
length(delay_female_2020)
#> [1] 4
# C: Male Delay
delay_male_2020 <- df2020 %>%
filter(gender == "M") %>%
pull(delay)
length(delay_male_2020)
#> [1] 3
# Step 2: Find Mean Difference (Female - Male)
observed_2020 <- mean(delay_female_2020) - mean(delay_male_2020)
observed_2020
#> [1] 140
# Step 3: Simulate
N <- 10^4-1 #set number of times to repeat this process
set.seed(99)
result_2020 <- numeric(N) # space to save the random differences
for(i in 1:N)
{
index_2020 <- sample(length(delay_all_2020),
size = length(delay_female_2020),
replace = FALSE)
result_2020[i] <- mean(delay_all_2020[index_2020]) - mean(delay_all_2020[-index_2020])
}
# Step 4: Plot
ggplot() +
geom_histogram(aes(result_2020)) +
geom_vline(xintercept=observed_2020, colour="red")+
xlab("Female - Male") +
ggtitle("2020")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Step 5: P-value ; H0: mu_female = mu_male, HA: mu_female > mu_male
(sum(result_2020 >= observed_2020) + 1)/(N + 1) #P-value
#> [1] 0.029
Created on 2020-11-11 by the reprex package (v0.3.0)