lm interpretation of output

Kate_Lee · May 30, 2019, 4:45am

This is probably more a statistical question rather than an R question, however I want to know how this lm() anaysis comes out with a significant adjusted p-value (p=0.008) when the St Err on the change in IGF2 (-0.04ng/ml) for every Kg increase in weight is huge (0.45ng/ml). The confidence interval of the effect size is therefore massive (-0.9-0.8).

I think I must be reading the output wrong.

Thanks in advance for any help
Kate

Call:
lm(formula = Cohort1$V1_IGF2_Result ~ Cohort1$SCREEN_Weight + 
    Cohort1$Sex + Cohort1$ETH)

Residuals:
    Min      1Q  Median      3Q     Max 
-451.62  -95.15    0.49   88.59  394.98 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           806.19536   81.65754   9.873   <2e-16 ***
Cohort1$SCREEN_Weight  -0.04404    0.45061  -0.098   0.9222    
Cohort1$SexM          -44.49234   24.63328  -1.806   0.0723 .  
Cohort1$ETHE           31.00856   74.11799   0.418   0.6761    
Cohort1$ETHM          -15.78481   80.27579  -0.197   0.8443    
Cohort1$ETHO          -29.85577   75.71341  -0.394   0.6937    
Cohort1$ETHP          -47.73104   77.43752  -0.616   0.5383    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 144.6 on 216 degrees of freedom
Multiple R-squared:  0.07641,	Adjusted R-squared:  0.05076 
F-statistic: 2.978 on 6 and 216 DF,  p-value: 0.008154

joels · May 30, 2019, 6:22am

The p-value (0.008154) in the bottom row of the summary table is the p-value for the F-statistic (2.978). The F statistic is a ratio of the variance explained by the regression model relative to a model with just the intercept and no other variables. The p-value is the probability of achieving an F statistic that large under the null hypothesis that your regression model is no better than a model with just the intercept.

In this case, the result means that even though there are no variables in the model that are individually statistically significant, the model overall provides a statistically significantly better fit to the data than a model with just the intercept.

You can also perform an F test to compare several models to see if adding one or more additional variables results in a statistically significantly improved fit. For example, in your case you could do*:

m1 = lm(V1_IGF2_Result ~ SCREEN_Weight + Sex + ETH, data=Cohort1)
m2 = lm(V1_IGF2_Result ~ SCREEN_Weight + Sex , data=Cohort1)
m3 = lm(V1_IGF2_Result ~ 1, data=Cohort1)

anova(m3, m2, m1, test="F")

Here is a description of how to calculate the F statistic. Below is an example of calculating the F-statistic in R. Note that the value calculated is the same as the value returned by summary(m1). The values 29 and 31 are the degrees of freedom (df) for models m1 and m2, respectively. df is the number of observations minus the number of parameters (regression coefficients) estimated by the model.

m1 = lm(mpg ~ hp + wt, data=mtcars)
m2 = lm(mpg ~ 1, data=mtcars)

summary(m1)
summary(m2)

ssrM1 = sum(resid(m1)^2)
ssrM2 = sum(resid(m2)^2)

F_statistic = ((ssrM2 - ssrM1)/(31 - 29)) / (ssrM1/29)

* Note that the model formulas include only the variable names, while the data frame is entered in the data argument. The model should be specified this way, rather than by including the data frame name with each variable in the model.

Yarnabrina · May 30, 2019, 3:52pm

Along with what Joel has said, I'd like to add a comment. Have you considered multicollinearity?

In your example, most of the regression coefficients are insignificant (only one is being rejected at 5% level of significance, and just marginally), but still the model rejects the null hypothesis of all regression coefficients being zero. This may occur as a result of multicollinearity. Here's from Wikipedia:

Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero (using an F -test)

I think that this model may be affected by multicollinearity because as far as I understand, this model uses dummy variables for both Cohort1$Sex and Cohort1$ETH.

I'm not saying that it is the case, but it's quite likely (IMO). I've forgotten much about these things, so I'm not very sure. May be Joel or someone else can provide some more helpful input?

system · June 20, 2019, 3:52pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.