Hi @JohnMount
Than you for taking the time out to provide such a detailed and comprehensive answer. I had a look at the vignette for your package and was wondering if you have the time could you clarify/confirm my understanding
-
From looking into the package and your code above the function kWayCrossValidation
returns a list of data that has been split into a list with ten training sets and ten validation sets
-
The code then proceeds to create a mean variable to country level within each fold. So for example taking the first fold of train and test dataset, you calculate the mean for the training set first on country level and a separate mean within the test set?
-
You then iterate through each train and test set fold (ten in told) and make the calculation on train and test set separately for each pair
-
The final output would be a ten list item with a train and test dataset where each pair would have a separate mean calculated.
Below is a reprex i tried to implement based on more realistic data taken from the lavaan
package. The data is the HolzingerSwineford1939 dataset and comprises the mental ability test scores of seventh- and eighth-grade children from two different schools
I have changed it slightly to make an artificial example using your library and returned the results in a dataframe so its easier to see what happened
library(lavaan)
library(tidyverse)
set.seed(100)
mydf <- HolzingerSwineford1939 %>%
select(ageyr:x9) %>%
filter(complete.cases(.))
# We will assume that x1 to x3 is a single attribute on individual level - attrib_1
# Similiarly we assume x4 to x6 is a single attribute on individual level - attrib_2
# Finally we assume that x7 to x8 is some measurement on school level - attrib_3
# We are trying to predict x9 - dep
mydf <- mydf %>%
mutate(attrib_1 = rowMeans(select(.,x1:x3)),
attrib_2 = rowMeans(select(.,x4:x6)),
dep = x9) %>%
select(ageyr:grade, x7, x8, starts_with('attr'), dep)
# Create a training and test set
library(caret)
index <- createDataPartition(y=mydf$dep, p=0.7, list=FALSE)
train <- mydf[index,]
test <- mydf[-index,]
# Create Cross Validation Plan
cross_plan <- vtreat::kWayCrossValidation(nrow(train), nSplits = 10, train, train$dep)
cross_plan_index <- seq(1:length(cross_plan))
# Create a function to calculate the mean per school per fold for both train and test
# Function will return a dataframe with the mean per school
create_fold_aggregates <- function(fold, cross_plan, mydf) {
# We create a test and a training set per fold
train_fold <- mydf[cross_plan[[fold]][["train"]], ] %>% rownames_to_column() %>% mutate(data_type = 'train')
test_fold <- mydf[cross_plan[[fold]][["app"]], ] %>% rownames_to_column() %>% mutate(data_type = 'test')
# Combine the train fold and the test fod together
est_data <- bind_rows(train_fold, test_fold)
# Generate mean aggregate per school per dataset type
# So the mean per school will differ in the training and the test
# It will also differ across folds
est_data_agg <- est_data %>%
select(data_type, school, x7, x8) %>%
mutate(ind_mean = rowMeans(select(., x7, x8))) %>%
group_by(data_type, school) %>%
summarise(school_mean = mean(ind_mean)) %>%
ungroup()
# Bring individual scores and group scores together
est_data <- est_data %>%
left_join(est_data_agg) %>%
select(-x7,-x8) %>%
rename(index = rowname) %>%
mutate(fold = paste0("fold", fold))
return(est_data)
}
# Review the new aggregates per fold for train and test
test <- map_df(.x = cross_plan_index, .f = create_fold_aggregates, cross_plan, train)
# Plot the mean per school per fold for both test and train just to get an undestanding on how the means change
ggplot(test, aes(x=fold, y = school_mean, colour = school, shape = data_type)) +
geom_point(size = 4) +
theme_minimal()
I notice in the vignette you use cross validation when implementing a logistic regression. If i wanted to utilize the above with say the structure of caret
to try many different models to try and predict the variable 'dep', Is this possible? I would like to take advantage of carets diagnostic tools for model performance as well as using the same structure to produce other models but wouldn't have a clue how to force caret to use our new calculated train/test data with custom folds
Thanks again for your time. I appreciate this is a bit long winded 