Shouldn't average k-fold cross validation results be different than not using k-fold?

Shouldn't average k-fold cross validation results be different than not using k-fold?

set.seed(123)
x <- rnorm(100,2,1)
y <- exp(x) + rnorm(5,0,2)
plot(x,y)
data <- data.frame(x,y)
model <- lm(y ~ x, data)
cat("Model ", "Intercept", "X1`", "RMSE", "\n")
RMSE <- sqrt(mean((data$y - fitted(model))^2))
cat("Default model:", model$coefficients, RMSE, "\n")
n <- nrow(data)
data.shuffled <- data[sample(n), ]
folds <- cut(seq(1, n), breaks=10, labels=FALSE) # ten 1's, ten 2's, etc.
errors <- c(0)
coeffs <- data.frame()
for(i in 1:10){
fold.indexes <- which(folds == i, arr.ind=TRUE)
test.data <- data[fold.indexes, ]
training.data <- data[-fold.indexes, ]
train.linear <- lm(y ~ x, training.data)
coeffs <- rbind(coeffs, train.linear$coefficients)
train.output <- predict(train.linear, test.data)
errors <- c(errors, sqrt(sum(((train.output - test.data$y)^2 / length(train.output)))))
}
colnames(coeffs) <- c("Intercept", "X1")
cat("Avg k-fold model:", mean(coeffs$Intercept), mean(coeffs$X1), mean(errors[2:11]) ,"\n")

Model Intercept X1` RMSE
Default model: -13.63225 11.98009 6.444126
Avg k-fold model: -13.64274 11.98094 6.425408

Think the CLT. We have a population that is normally distributed by design, and then take random samples, and the mean of the samples closely approximates the mean of the population. That’s not mysterious.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.