Split, apply, combine regression model assistance request

g3lo · May 30, 2018, 4:20pm

Example of Data Set:

key/y/x1/x2/x3
1/1/2/3
1/2/4/8
1/8/2/6
2/7/8/2
2/1/2/3
2/2/4/8
2/8/2/6
2/7/8/2

Essentially, I want to develop a code that would automatically run and produce seperate regressions based on a 'key'. For instance, in the above example we would have 2 regression equations (since there is 2 keys - 1 and 2). I need to run 500 separate equations based on unique identifiers, call them keys. Would be nice to also have an option to do the same and run a stepwise for additional insights. I want to accomplish this task in R-Studio.

Any help would be appreciated.

g3lo · May 31, 2018, 6:17pm

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

Peter_Griffin · May 31, 2018, 6:44pm

I guess you may want to take a look at the plm package and treat your key as a fixed effect factor. If you really want to run separate regressions for each group, there are at least 2 ways you can do that. First, use for loop to create a list consisting of all the regression models. Second, work on the data frame with dplyr and use the group_by and mutate to create columns of your interests (coefficients, r squared, etc.)

g3lo · May 31, 2018, 8:29pm

Are you able to assist with a sample code based on the simple dataframe above?

jcblum · June 1, 2018, 1:35pm

This is a common analysis pattern that is often called “split, apply, combine” because you are splitting your data into groups based on a key value, applying a model to each group, and then combining the results back together for further inspection. As @Peter_Griffin mentioned, there are a few different ways to write code that does this. This link goes through one tidyverse approach in detail:
http://stat545.com/block024_group-nest-split-map.html

The overview to the tidyverse package purrr also has a simple example right up front:

The following example uses purrr to solve a fairly realistic problem: split a data frame into pieces, fit a model to each piece, compute the summary, then extract the R².

   library(purrr)

   mtcars %>%
     split(.$cyl) %>% # from base R
    map(~ lm(mpg ~ wt, data = .)) %>%
    map(summary) %>%
     map_dbl("r.squared")
   #>         4         6         8 
   #> 0.5086326 0.4645102 0.4229655

In all cases, the first step is to write modeling code that does what you want for one piece of your dataset. Can you post an example of the model code you have in mind? That would make it a lot easier for people to help you with splitting-applying-combining that model to your data.

P.S. You also might attract more interested helpers if you changed the title of your post to something more specific to your problem.

g3lo · June 1, 2018, 2:52pm

Thank you for your help. Based on the sample data in the OP, say I have this model.

fit <- lm(y~., data = select(df, key==1))

This will select key = 1 and apply a regression model on all x variables. How do I automate this process for all key's, say 500 of them and produce a summary of each model in the form of say a matrix.

For example a summary would look like this. It doesn't have to be exactly that as I need the p-values as well. But something that I can then use and calculate the answers to different x variables quickly:

#             y1         y2         y3       
#(Intercept)  -0.081155   0.042049   0.007261
#x1           -0.037556   0.181407  -0.070109
#x2           -0.334067   0.223742   0.015100
#x3            0.057861  -0.075975  -0.099762

Although my y is in the same column, the models will differ as I use a separate key to identify each model.

martin.R · June 1, 2018, 3:12pm

You can use the broom package for this.

g3lo · June 1, 2018, 3:16pm

I tried, but was unsuccessful replicating 500 loops into a nice summary of multiple models.

g3lo · June 1, 2018, 3:23pm

Here is what I tried:

fitlist <- list()
for(i in unique(df$key)){
        fitlist <- append(fitlist, lm(y ~ ., 
                                            data = df[df$key == i]))
    }
}

tidy(fitlist)

It seems it cannot tidy a list, and I am unsure where to store results from my loop

g3lo · June 1, 2018, 3:44pm

I also found this in another forum, this would ideal. Is this possible to do with incorporating my code above?

 mtable123 <- mtable("Model 1"=lm1,"Model 2"=lm2,"Model 3"=lm3,
     summary.stats=c("sigma","R-squared","F","p","N"))

> mtable123

Calls:
Model 1: lm(formula = weight ~ group)
Model 2: lm(formula = weight ~ group - 1)
Model 3: lm(formula = log(weight) ~ group - 1)

=============================================
                 Model 1   Model 2   Model 3 
---------------------------------------------
(Intercept)      5.032***                    
                (0.220)                      
group: Trt/Ctl  -0.371                       
                (0.311)                      
group: Ctl                 5.032***  1.610***
                          (0.220)   (0.045)  
group: Trt                 4.661***  1.527***
                          (0.220)   (0.045)  
---------------------------------------------
sigma             0.696      0.696     0.143 
R-squared         0.073      0.982     0.993 
F                 1.419    485.051  1200.388 
p                 0.249      0.000     0.000 
N                20         20        20     
=============================================

benjamin.gowan · June 14, 2018, 5:30pm

Hi there @g3lo, I think you are looking for something along the lines of this code, where you nest the data by key and fit a model to each dataset and then extract out the summaries using a combination of purrr and broom::glance

library(tidyverse)
library(broom)

df %>% 
group_by(key) %>% 
nest() %>% 
mutate(models = map(data, ~ lm(y~., data = .x))) %>% 
mutate(summaries = map(models, glance)) %>% 
unnest(summaries)

Hope that helps!

Cheers,
Ben