How to perform group-wise linear regression for a data frame in R

zeeshan0112 · January 31, 2023, 8:31am

I have updated the dplyr and getting this error now

Error in UseMethod("nest_by") :
no applicable method for 'nest_by' applied to an object of class "function"

nirgrahamuk · January 31, 2023, 8:36am

mtcars |> 
  nest_by(cyl)

Can you run this ?

zeeshan0112 · January 31, 2023, 8:37am

yes i am able to run this

mtcars |>

nest_by(cyl)

A tibble: 3 × 2

Rowwise: cyl

cyl                data

<list<tibble[,10]>>
1 4 [11 × 10]
2 6 [7 × 10]
3 8 [14 × 10]

nirgrahamuk · January 31, 2023, 8:38am

Can you provde an example of the code you are trying that gives you an error ?

zeeshan0112 · January 31, 2023, 8:40am

groupLM <- sample|>
nest_by(bank_year) |>
mutate(lm_model = list(lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14, d = sample)))

nirgrahamuk · January 31, 2023, 9:30am

this is a good example of a general lesson; choosing good names for our objects; and preferring names that don't clash with base R function names.
base::sample is a function; if you have a data.frame related to some sample, consider names like sample_df etc.

zeeshan0112 · January 31, 2023, 12:09pm

The code is working fine but the results are identical. I belive the result of single bank-year combination in copied in all regression.

library(dplyr)
library(tidyverse)
library(broom)
data_5 <- read.csv("data_sample.csv")

y <- data_5$nse_returns
x1 <- data_5$auto
x2 <- data_5$consumer_durables
x3 <- data_5$FMCG
x4 <- data_5$healthcare
x5 <- data_5$IT
x6 <- data_5$media
x7 <- data_5$metal
x8 <- data_5$oil_gas
x9 <- data_5$pharma
x10 <- data_5$reality
x11 <- data_5$finance
x12 <- data_5$Mkt.RF
x13 <- data_5$SMB
x14 <- data_5$HML
groupLM <- data_5 |>

nest_by(bank_year) |>
mutate(lm_model = list(lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14, d = data_5)))

groupLM

A tibble: 196 × 3

Rowwise: bank_year

bank_year data lm_model
<list<tibble[,18]>>
1 ALD2018 [246 × 18]
2 ALD2019 [244 × 18]
3 ALD2020 [55 × 18]
4 ANDHRA2018 [246 × 18]
5 ANDHRA2019 [244 × 18]
6 ANDHRA2020 [55 × 18]
7 AUSF2018 [246 × 18]
8 AUSF2019 [244 × 18]
9 AUSF2020 [250 × 18]
10 AUSF2021 [248 × 18]

… with 186 more rows

Use `print(n = ...)` to see more rows

groupLM |> reframe(glance(lm_model))

A tibble: 196 × 13

bank_year r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵ nobs

1 ALD2018 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
2 ALD2019 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
3 ALD2020 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
4 ANDHRA2018 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
5 ANDHRA2019 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
6 ANDHRA2020 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
7 AUSF2018 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
8 AUSF2019 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
9 AUSF2020 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
10 AUSF2021 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358

… with 186 more rows, and abbreviated variable names ¹r.squared, ²adj.r.squared, ³statistic,

⁴deviance, ⁵df.residual

Use `print(n = ...)` to see more rows

nirgrahamuk · January 31, 2023, 12:22pm

Please format your post.

My advice is to think about the param that lm takes to establish the data it should use. If the nest operation produced an appropriate table and had it in a list column called data , then its that that should be used, and certainly not the entire unnested dataset (data_5)

I gave similar recommendation when do was discussed.

zeeshan0112 · January 31, 2023, 12:48pm

> data_5 <- read.csv("data_sample.csv")
> y <- data_5$nse_returns
> x1 <- data_5$auto
> x2 <- data_5$consumer_durables
> x3 <- data_5$FMCG
> x4 <- data_5$healthcare
> x5 <- data_5$IT
> x6 <- data_5$media
> x7 <- data_5$metal
> x8 <- data_5$oil_gas
> x9 <- data_5$pharma
> x10 <- data_5$reality
> x11 <- data_5$finance
> x12 <- data_5$Mkt.RF
> x13 <- data_5$SMB
> x14 <- data_5$HML
> groupLM <- data_5 |> 
+   nest_by(bank_year) |> 
+   mutate(lm_model = list(lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14, d = data_5)))

> groupLM
# A tibble: 196 × 3
# Rowwise:  bank_year
   bank_year                 data lm_model
   <chr>      <list<tibble[,18]>> <list>  
 1 ALD2018             [246 × 18] <lm>    
 2 ALD2019             [244 × 18] <lm>    
 3 ALD2020              [55 × 18] <lm>    
 4 ANDHRA2018          [246 × 18] <lm>    
 5 ANDHRA2019          [244 × 18] <lm>    
 6 ANDHRA2020           [55 × 18] <lm>    
 7 AUSF2018            [246 × 18] <lm>    
 8 AUSF2019            [244 × 18] <lm>    
 9 AUSF2020            [250 × 18] <lm>    
10 AUSF2021            [248 × 18] <lm>    
# … with 186 more rows
# ℹ Use `print(n = ...)` to see more rows

> groupLM |> reframe(glance(lm_model))

# A tibble: 196 × 13
   bank_year  r.squ…¹ adj.r…²  sigma stati…³ p.value    df logLik     AIC     BIC devia…⁴ df.re…⁵  nobs
   <chr>        <dbl>   <dbl>  <dbl>   <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <int> <int>
 1 ALD2018     0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 2 ALD2019     0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 3 ALD2020     0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 4 ANDHRA2018  0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 5 ANDHRA2019  0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 6 ANDHRA2020  0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 7 AUSF2018    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 8 AUSF2019    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 9 AUSF2020    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
10 AUSF2021    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
# … with 186 more rows, and abbreviated variable names ¹r.squared, ²adj.r.squared, ³statistic,
#   ⁴deviance, ⁵df.residual
# ℹ Use `print(n = ...)` to see more rows

> groupLM |> reframe(tidy(lm_model))

# A tibble: 2,940 × 6
   bank_year term        estimate std.error statistic  p.value
   <chr>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 ALD2018   (Intercept) 1.00     0.000134   7479.    0       
 2 ALD2018   x1          0.000887 0.000161      5.51  3.52e- 8
 3 ALD2018   x2          0.000728 0.000168      4.33  1.51e- 5
 4 ALD2018   x3          0.000531 0.000208      2.56  1.06e- 2
 5 ALD2018   x4          0.00116  0.000634      1.82  6.86e- 2
 6 ALD2018   x5          0.000116 0.000181      0.639 5.23e- 1
 7 ALD2018   x6          0.00144  0.0000893    16.2   1.05e-58
 8 ALD2018   x7          0.000675 0.000107      6.33  2.40e-10
 9 ALD2018   x8          0.00289  0.000185     15.7   3.81e-55
10 ALD2018   x9          0.000389 0.000556      0.699 4.85e- 1
# … with 2,930 more rows
# ℹ Use `print(n = ...)` to see more rows

nirgrahamuk · January 31, 2023, 1:46pm

This is just my opinion but to me, without getting extra context from you that would explain/justify/motivate this; this stuff seems both self-defeating; and pointless extra work ?

Practically; the negative impact of having done this is that given these (x1-x14) things dont exist in the data_5 that you nest. so when they appear in your lm formula; lm is possibly too smart for its own good and goes directly to the objects you named out (y, x1,x2) and so its no longer possibly data driven by any nesting; and you have persisted in repeating to pass data_5 as a d= param, when I've told you two previous times that this does not work and should be the product of the nest...

question 1) do you have a requirement to hide the actual variable names and sub them for non-descriptive names such as x1-x14 ?
if you do we can talk about good approaches; but I would guess that you dont ...

zeeshan0112 · January 31, 2023, 2:07pm

There is no requirement to hide the actual variable names. I didn't used the actual names, not to make the model messy. I am not expert in R, I dont know many protocol. I am really sorry in case my silly mistakes are displeasing you. If i use the actual names will it work?

The actual names are y is the nse returns and x is the (auto to HML)

date	name	year	bank_year	nse_returns	auto	consumer_durables	FMCG	healthcare	IT	media	metal	oil_gas	pharma	reality	finance	Mkt.RF	SMB	HML	RF
01-01-2008	ALD	2008	ALD2008	1.09	0.871	-0.528	2.199	-0.097	-1.308	1.599	0.195	-0.26	-0.255	2.907	0.234	0.02	0.01	-0.01	0.01
01-02-2008	ALD	2008	ALD2008	1.02	1.611	-2.091	0.07	-0.85	2.845	-5.311	-0.997	-0.23	-1.14	-4.486	-2.192	1.26	-0.05	-0.14	0.01
01-04-2008	ALD	2008	ALD2008	1	-0.812	0.94	1.871	-0.649	-1.065	-0.586	-1.967	2.471	-0.979	-0.638	-1.127	1.95	-1.41	0.19	0.01
01-07-2008	ALD	2008	ALD2008	0.96	-1.906	-0.632	-1.026	-0.429	0.44	-2.274	-0.653	-0.352	-0.604	-1.537	-1.366	-0.66	0.09	-0.16	0.01
01-08-2008	ALD	2008	ALD2008	1.01	-1.92	-0.546	-1.826	-0.348	1.442	-0.589	0.354	0.844	0.242	-1.053	1.491	-1.09	0.51	0.36	0.01

nirgrahamuk · January 31, 2023, 2:15pm

I've attempted to go through and apply EconProfs approach to what we understand of your data, and model needs. I've tried to be more explicit than is needed; by renaming the results of the nest_by and using that name as appropriate within lm()


data_5 <- read.csv("data_sample.csv")

groupLM <- data_5 |> 
  nest_by(bank_year,
          .key = "nested_data") |> 
  mutate(lm_model = list(lm(nse_returns ~auto +
                            consumer_durables +
                            FMCG +
                            healthcare +
                            IT +
                            media +
                            metal +
                            oil_gas +
                            pharma +
                            reality +
                            finance +
                            Mkt.RF +
                            SMB +
                            HML, d = nested_data)))

groupLM |> reframe(glance(lm_model))

groupLM |> reframe(tidy(lm_model))

zeeshan0112 · January 31, 2023, 2:29pm

nirgrahamuk:

groupLM <- data_5 |> 
  nest_by(bank_year,
          .key = "nested_data") |> 
  mutate(lm_model = list(lm(nse_returns ~auto +
                            consumer_durables +
                            FMCG +
                            healthcare +
                            IT +
                            media +
                            metal +
                            oil_gas +
                            pharma +
                            reality +
                            finance +
                            Mkt.RF +
                            SMB +
                            HML, d = nested_data)))

groupLM |> reframe(glance(lm_model))

groupLM |> reframe(tidy(lm_model))

Thank you very very much this worked.

deafcrump · February 6, 2023, 1:20pm

I have posted the sample sample data. can you please help me out..

.

nirgrahamuk · February 6, 2023, 1:25pm

Did you post in the wrong thread ?

system · February 13, 2023, 1:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

How to perform group-wise linear regression for a data frame in R

A tibble: 3 × 2

Rowwise: cyl

A tibble: 196 × 3

Rowwise: bank_year

… with 186 more rows

Use print(n = ...) to see more rows

A tibble: 196 × 13

… with 186 more rows, and abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic,

⁴​deviance, ⁵​df.residual

Use print(n = ...) to see more rows

Use `print(n = ...)` to see more rows

… with 186 more rows, and abbreviated variable names ¹r.squared, ²adj.r.squared, ³statistic,

⁴deviance, ⁵df.residual

Use `print(n = ...)` to see more rows