How to create 1,024 Linear Regression models using a For Loop and extract Coefficients, calculate scores?

Hi all,

My problem is for the mtcars data set in R, I need to create all possible additive linear regression models where I'm regressing on the mpg variable. The null model is easy, as there's

10 choose 0 ways to get the null model,
and 10 choose 1 ways to create a SLR on mpg;
10 choose 2 ways to create a two variable regression on mpg;
10 choose 3 ways to create a SLR on mpg; etc.,

So in total, as this is equivalent to summing across the 10th row in Pascal's Triangle, the total models I need to consider comes out to be 1,024.

Now, the other tricky part is I need to somehow store each model in some separate object so that all the 2 variable models are grouped together, all the three variable models are grouped together, etc, on top of also storing all them together (though perhaps there's a more efficient way to do this). The reason for this is my task is to look at all of these models, take their AIC scores and their Mallow's Cp scores, store those in a data frame, and then sort those scores from lowest to highest and keep the top 10. On top of this, I need to also be able to store, see, and have access to/use the two best 1-variable models through the two best 10-variable models because I need to provide the r-squared values and adjusted r-squared values for these various models along with the error mean square value. I'm still pretty/relatively new to R/coding in general, but I provide my attempt below:

library(rje)   # provides the powerSet function
library(olsrr) # provides the ols_mallows_cp function to calculate the Mallow's Cp values

mtcars <- datasets::mtcars

x <- powerSet(colnames(mtcars[,-1]))

datalist <- list()
for(i in c(2:1024)){
  datalist[[i]] <- mtcars[,colnames(mtcars) %in% c("mpg",x[[i]]) ]
}

full_model <- lm(mpg ~ ., data = mtcars)
Cp_vec <- c()

for (i in c(2:1024)){
  model <- lm(mpg ~ ., data = datalist[[i]])
  Cp_vec[i] <- ols_mallows_cp(model, full_model)
}

names(Cp_vec) <- as.character(c(1:1024)) 
TenSmallestCp <- Cp_vec[cpvec %in% head(sort(Cp_vec),10)]
Small_List <- list()

for (i in 1:10){
  Small_List[[i]] <- x[[as.numeric(names(TenSmallestCp))[i]]]
}

Small_List[[1]]
Small_List[[2]]
Small_List[[3]]
Small_List[[4]]
Small_List[[5]]
Small_List[[6]]
Small_List[[7]]
Small_List[[8]]
Small_List[[9]]
Small_List[[10]]

The way I currently have it produces this as its output:

[1] "cyl" "wt" 
[1] "hp" "wt"
[1] "cyl" "hp"  "wt" 
[1] "cyl"  "wt"   "qsec"
[1] "hp" "wt" "am"
[1] "wt"   "qsec" "am"  
[1] "disp" "wt"   "qsec" "am"  
[1] "hp"   "wt"   "qsec" "am"  
[1] "cyl"  "wt"   "carb"
[1] "wt"   "qsec" "am"   "carb"

So this tells me what the 10 best models are with regards to the Mallow's Cp scores, but perhaps it's just because I've been staring at this problem for way too long, but I can't figure out how to actually save the linear model and have access to it, say, if I wanted to plot it or something. I know I could just easily recreate it with my output, but I'm also trying to become efficient with my coding and not always resort to hard coding things, you know? I also cannot figure out how to store the models based on the number of variables that are included in the model so I can access the top two models from each.

Before posting this, I checked out these links:

  1. https://stackoverflow.com/questions/27952653/how-to-loop-repeat-a-linear-regression-in-r

  2. https://stackoverflow.com/questions/66152702/regression-with-for-loop-with-changing-variables?noredirect=1&lq=1

  3. https://stackoverflow.com/questions/46493011/r-loop-for-variable-names-to-run-linear-regression-model

I admit that because I'm new, the answer to my problem(s) might fully exist in some linear combination of these three answers, and I'm just having trouble seeing it and putting it together, but while I think the first link I shared does have a lot that's relevant to my problem, and the last one also is pretty related, I'm not sure how the second one is much help. That's why I'm posting this as a new question.

Thanks for taking the time to read this lengthy post and consider helping me with my problem here!

Hi @Radon_my_Nikodym196,

I always try to avoid for loops unless absolutely necessary. I think we can avoid it here by using a map/lapply approach, and it actually helps to keep us organized. I'm going to keep everything (formulas, models, AICs) nested inside a single data frame. This is preferable because data frames are easy to manipulate with existing tools (e.g. dplyr), and also because it saves us from needing to store intermediate results in many different objects.

First, I find all of the possible models using the combinat::combn() function. I use this to create each of the 1,024 formulas. Then we can place this all nicely into a data frame (actually, a tibble), and use purrr::map() to loop over each formula and fit the linear model. Once models are fit, we can extract the AIC (or other such statistics) and use our friendly dplyr tools to find the "best" models.

This should be a good starting point. Let me know if you have any questions.

library(tidyverse)
library(glue)
library(combinat)

xvars <- names(mtcars[, -1])

# Find all possible models
data <- 
  map_dfr(1:length(xvars), function(m) {
  combn(xvars, m, fun = glue_collapse, sep = " + ") %>% 
    enframe(NULL, "formula")
}, .id = "nvars") %>% 
  bind_rows(
    tibble(nvars = "0", formula = "1"),
    .
  ) %>% 
  mutate(nvars = as.integer(nvars))

head(data)
#> # A tibble: 6 x 2
#>   nvars formula
#>   <int> <chr>  
#> 1     0 1      
#> 2     1 cyl    
#> 3     1 disp   
#> 4     1 hp     
#> 5     1 drat   
#> 6     1 wt
tail(data)
#> # A tibble: 6 x 2
#>   nvars formula                                                   
#>   <int> <chr>                                                     
#> 1     9 cyl + disp + hp + drat + qsec + vs + am + gear + carb     
#> 2     9 cyl + disp + hp + wt + qsec + vs + am + gear + carb       
#> 3     9 cyl + disp + drat + wt + qsec + vs + am + gear + carb     
#> 4     9 cyl + hp + drat + wt + qsec + vs + am + gear + carb       
#> 5     9 disp + hp + drat + wt + qsec + vs + am + gear + carb      
#> 6    10 cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

# Fit models, extract AIC
all_models <-
  data %>% 
  mutate(
    formula = glue("mpg ~ {formula}"),
    model = map(formula, lm, data = mtcars),
    aic = map_dbl(model, AIC)
  )

all_models
#> # A tibble: 1,024 x 4
#>    nvars formula    model    aic
#>    <int> <glue>     <list> <dbl>
#>  1     0 mpg ~ 1    <lm>    209.
#>  2     1 mpg ~ cyl  <lm>    169.
#>  3     1 mpg ~ disp <lm>    170.
#>  4     1 mpg ~ hp   <lm>    181.
#>  5     1 mpg ~ drat <lm>    191.
#>  6     1 mpg ~ wt   <lm>    166.
#>  7     1 mpg ~ qsec <lm>    205.
#>  8     1 mpg ~ vs   <lm>    192.
#>  9     1 mpg ~ am   <lm>    196.
#> 10     1 mpg ~ gear <lm>    202.
#> # … with 1,014 more rows

# Find top 10 models by AIC
top_10_models <-
  all_models %>% 
  slice_min(order_by = aic, n = 10)

top_10_models
#> # A tibble: 10 x 4
#>    nvars formula                            model    aic
#>    <int> <glue>                             <list> <dbl>
#>  1     3 mpg ~ wt + qsec + am               <lm>    154.
#>  2     4 mpg ~ hp + wt + qsec + am          <lm>    154.
#>  3     4 mpg ~ wt + qsec + am + carb        <lm>    155.
#>  4     5 mpg ~ disp + hp + wt + qsec + am   <lm>    155.
#>  5     3 mpg ~ cyl + hp + wt                <lm>    155.
#>  6     4 mpg ~ disp + wt + qsec + am        <lm>    155.
#>  7     3 mpg ~ cyl + wt + carb              <lm>    156.
#>  8     5 mpg ~ drat + wt + qsec + am + carb <lm>    156.
#>  9     5 mpg ~ hp + wt + qsec + am + carb   <lm>    156.
#> 10     4 mpg ~ cyl + wt + qsec + am         <lm>    156.

# Find top 2 models by number of X variables
top_2_models_by_nvars <-
  all_models %>% 
  group_by(nvars) %>% 
  slice_min(order_by = aic, n = 2) %>% 
  ungroup()

top_2_models_by_nvars
#> # A tibble: 20 x 4
#>    nvars formula                                                     model   aic
#>    <int> <glue>                                                      <lis> <dbl>
#>  1     0 mpg ~ 1                                                     <lm>   209.
#>  2     1 mpg ~ wt                                                    <lm>   166.
#>  3     1 mpg ~ cyl                                                   <lm>   169.
#>  4     2 mpg ~ cyl + wt                                              <lm>   156.
#>  5     2 mpg ~ hp + wt                                               <lm>   157.
#>  6     3 mpg ~ wt + qsec + am                                        <lm>   154.
#>  7     3 mpg ~ cyl + hp + wt                                         <lm>   155.
#>  8     4 mpg ~ hp + wt + qsec + am                                   <lm>   154.
#>  9     4 mpg ~ wt + qsec + am + carb                                 <lm>   155.
#> 10     5 mpg ~ disp + hp + wt + qsec + am                            <lm>   155.
#> 11     5 mpg ~ drat + wt + qsec + am + carb                          <lm>   156.
#> 12     6 mpg ~ disp + hp + drat + wt + qsec + am                     <lm>   156.
#> 13     6 mpg ~ disp + hp + wt + qsec + am + gear                     <lm>   156.
#> 14     7 mpg ~ disp + hp + drat + wt + qsec + am + gear              <lm>   158.
#> 15     7 mpg ~ cyl + disp + hp + drat + wt + qsec + am               <lm>   158.
#> 16     8 mpg ~ disp + hp + drat + wt + qsec + am + gear + carb       <lm>   160.
#> 17     8 mpg ~ disp + hp + drat + wt + qsec + vs + am + gear         <lm>   160.
#> 18     9 mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb  <lm>   162.
#> 19     9 mpg ~ cyl + disp + hp + drat + wt + qsec + am + gear + carb <lm>   162.
#> 20    10 mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear … <lm>   164.
3 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.