Manually Creating the Cross Validation Process in R

omario · October 8, 2021, 2:05am

I am working with the R programming language. In this problem, I want to check how well the 80th percentile (for each group) within a training sample generalizes to unseen test data. Basically, I am trying to manually re-create the "k-fold cross validation" (Cross-validation (statistics) - Wikipedia procedure where instead of a classical predictive model, the model in this case is just the 80th percentile.

I attempted to write some R code that corresponds to this procedure (note: assume that "my_data" is the true population):

#create data

library(dplyr)

set.seed(123)



salary <- rnorm(1000,5,5)
height <- rnorm(1000,2,2)
my_data = data.frame(salary, height)
plot(my_data$salary, my_data$height)


results <- list()
for (i in 1:100) {
  train_i<-sample_frac(my_data, 0.7)
  sid<-as.numeric(rownames(train_i))
  test_i<-my_data[-sid,]
 
  salary_quantiles = data.frame( train_i %>% summarise (quant_1 = quantile(salary, 0.33),
                                                        quant_2 = quantile(salary, 0.66),
                                                        quant_3 = quantile(salary, 0.99)))
 
  train_i$salary_type = as.factor(ifelse(train_i$salary < salary_quantiles$quant_1 , "A", ifelse( train_i$salary  >  salary_quantiles$quant_1  & train_i$salary < salary_quantiles$quant_2, "B", "C")))
 
  height_quantiles = data.frame( train_i %>%  group_by(salary_type)  %>%  summarise(quant_80 = quantile(height, 0.80)))
  test_i$salary_type = as.factor(ifelse(test_i$salary < salary_quantiles$quant_1 , "A", ifelse( test_i$salary  >  salary_quantiles$quant_1  & test_i$salary < salary_quantiles$quant_2, "B", "C")))
  test_i$height_pred <- height_quantiles$quant_80[match(test_i$salary_type, height_quantiles$salary_type)]
  test_i$accuracy = ifelse(test_i$height_pred > test_i$height, 1, 0)
 

 
  results_tmp = data.frame(test_i %>%
                           group_by(salary_type) %>%
                           dplyr::summarize(Mean = mean(accuracy, na.rm=TRUE)))
 
  results_tmp$iteration = i
 
  results_tmp$total_mean = mean(test_i$accuracy)
  results[[i]] <- results_tmp
}
results

#view results
results_df <- do.call(rbind.data.frame, results)

#overall performance
mean(results_df$total_mean)

0.76533

The above R code appears to work. In this example, it would appear that the 76.5 % of time, unseen observations are below the 80% percentile mark generated from the training data. Thus, I can re-do this analysis on the whole data :

 salary_quantiles = data.frame( my_data %>% summarise (quant_1 = quantile(salary, 0.33),
                                                        quant_2 = quantile(salary, 0.66),
                                                        quant_3 = quantile(salary, 0.99)))

salary_quantiles
   quant_1  quant_2  quant_3
1 3.005188 6.952076 16.98823



      my_data$salary_type = as.factor(ifelse(my_data$salary < salary_quantiles$quant_1 , "A", ifelse( my_data$salary  >  salary_quantiles$quant_1  & my_data$salary < salary_quantiles$quant_2, "B", "C")))
      
      height_quantiles = data.frame( my_data %>%  group_by(salary_type)  %>%  summarise(quant_80 = quantile(height, 0.80)))

> height_quantiles
  salary_type quant_80
1           A 3.743133
2           B 3.773208
3           C 3.906956

Now, I can reasonably conclude that if:

Salary between (0, 3.005188) then approximately 80% of heights will be below 3.743 (for unseen data similar to "my_data")
Salary between (3.005188, 6.9520760) then approximately 80% of heights will be below 3.773208 (for unseen data similar to "my_data")
Salary between (6.952076 16.98823) then approximately 80% of heights will be below 3.906956 (for unseen data similar to "my_data")

Question : Can someone please tell me if the R code I have indeed corresponds to the "K Fold Cross Validation" process ? Is this correct?

system · October 29, 2021, 2:05am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.