k-fold cross validation for prediction

Hi. I am trying to understand k-fold cross validation for prediction. I took this example from Intro to machine learning by Burger, p. 68, and then modified it. Having chosen the best of k=10 models from some criteria, do i then want to create a final test dataset from the original dataset to predict and measure the performance, or am I missing something? (Note that this code creates the k-fold cross validation process from scratch rather than from a package like caret so I could pause and understand intermediate steps.) Thank you.

set.seed(123)
x <- rnorm(100,2,1) # n=100
y <- exp(x) + rnorm(5,0,2)
data <- data.frame(x,y)
data.shuffled <- data[sample(nrow(data)), ]
folds <- cut(seq(1, nrow(data)), breaks=10, labels=FALSE) #vector: 1,1,...,1,2,2,...,2,10
errors <- 0
coeff_df <- data.frame()
for (i in 1:10){
fold.indexes <- which(folds == i, arr.ind = TRUE) # 1...10; 11...20, etc.
test.data <- data[fold.indexes, ] # 10 rows
train.data <- data[-fold.indexes, ] # 90 rows
train.linear <- lm(y ~ x, train.data)
coeff_df[i,1] <- summary(train.linear)$coefficients[1,1]
coeff_df[i,2] <- summary(train.linear)$coefficients[2,1]
train.output <- predict(train.linear, test.data) # 10 rows
errors <- c(errors, sqrt(sum(((train.output - test.data$y)^2/length(train.output))) ))
}

errors[2:11]
m <- min(errors[2:11])
w = which(errors[2:11] == m)
cat("y = ", round(coeff_df[w,1],4), " + ",round(coeff_df[w,2],4), " * x" )

Output is y = -13.48 + 12.0834 * x

I think there is a problem in your reasoning: the folds of k-fold CV are not meant to generate different models, but to evaluate a single model in several contexts.

Say, we have data, and we're not sure if we should use linear regression or logistic regression (yes, it's a made-up example). We can make a train-test split, and run both linear and logistic regression on the training set, and check on the testing set whether they give good results. We might find that linear regression gives good results, logistic regression doesn't, we can conclude that, for our data, linear regression is more appropriate.

But once that's done, it doesn't tell you that the individual parameters of the linear regression you did on your train set are the best for any data, it just tells you they were the best for the particular data you had, with the particular train/test split you did.

At this point, I don't see any problem in grouping the training and test sets, and rerunning a linear regression on the whole dataset, now that you know that linear regression is the most appropriate model. The split was only necessary during model selection.

There is just one case where this is inappropriate: if you want to use the amount of error in the test set as a prediction of the error on future data. If the test set was not used at all during model selection and parameter estimation, then it reflects the error of the model on unseen data. So, to get the best of both words, it's common to do three splits, into a training, validation, and test set. Then you can use the training to train models, the validation to compare them and select the best model, and the test set to estimate the amount of error of this model on future data.

Cross-Validation is just a way to redo the train/validation split several times, so when you compare models you can see if one is systematically better than another, or if one is only slightly better but that depends on the data.

One additional point: here, I'm not mentioning parameter estimation, that's because you are using linear regression that already guarantees it will find the best parameters (for some definition of best, namely minimizing the squared error). Similarly, logistic regression promises to give the best parameters (for a different definition of best). So if you're using CV to compare linear and logistic regression, you're asking "which definition of best gives me the better prediction of future values". There are models that have so-called hyperparameters where you have to choose a value, in that case you can use CV to try out different values.

As for your code, it looks correct. One thing: as defined, your folds are in the same order as x. That's fine here since x is random, but in real data chances are x is ordered, so you'd want to randomize the fold attribution:

folds <- sample(cut(seq(1, nrow(data)), breaks=10, labels=FALSE))
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.