I have been tasked to generate a large dataset corresponding the following request:

"Write a function that takes a size n, then (1) builds a dataset using the code provided in Q1 but with n observations instead of 100 and without the set.seed(1), (2) runs the replicate() loop that you wrote to answer Q1, which builds 100 linear models and returns a vector of RMSEs, and (3) calculates the mean and standard deviation. "

Dataset = underneath

n <- c(100, 500, 1000, 5000, 10000)

Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)

dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%

data.frame() %>% setNames(c("x", "y"))

rmse <- replicate(100, {

test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)

train_set <- dat %>% slice(-test_index)

test_set <- dat %>% slice(test_index)

fit <- lm(y ~ x, data = train_set)

y_hat <- predict(fit, newdata = test_set)

sqrt(mean((y_hat-test_set$y)^2))

## })

The goal of this is to return the numbers assigned to variable "n" as ''mean'' & ''Standard deviation''.

So far i have approached the numbers of "n" & "RMSE's" to be plugged as value within sapply as reference to results:

## results <- sapply(n, rmse)

that transits to a error: ''can't extract residuals from model''. However performing the "mean" specified with a row or column index "[1]" manually:

## mean(rmse[1])

an incorrect decimal value is received, whereby SD is nothing more than a "NA" attribute.

sd(rmse[1])

[1] NA

I might have overlooked some critical factors here. A friendly reminder with extra approaches and tips to solve the section would be highly appreciated.

Thanks,

Irvin