GAM (Generalized Additive Model) Regression

Hi,

I am trying to run a regression analysis on my GAM model in R and am using the blog DataTechNotes as a guide (Link: DataTechNotes: GAM (Generalized Additive Model) Regression Example with R).

In the example from the blog, they created a GAM model using the following code to create their GAM model

bgam=gam(medv~s(rm)+s(lstat)+ptratio+indus+crim+zn+age, data=boston)

Similarly, I used the following code to create my GAM model

preGAM=gam(preOutlierRem.dat$DailyIncrease ~ s(preOutlierRem.dat$Precipitation))

When they view the summary their GAM model using the code

summary(bgam)

They receive the following output which shows the estimate of parametric coefficients (shown below)

Family: gaussian 
Link function: identity 

Formula:
medv ~ s(rm) + s(lstat) + ptratio + indus + crim + zn + age

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.54656    2.08187  15.153  < 2e-16 ***
ptratio     -0.52355    0.10112  -5.178 3.29e-07 ***
indus        0.00383    0.04052   0.095   0.9247    
crim        -0.13005    0.02498  -5.207 2.84e-07 ***
zn          -0.01682    0.01065  -1.579   0.1150    
age          0.01848    0.01055   1.751   0.0806 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value    
s(rm)    6.514  7.680 24.30  2e-16 ***
s(lstat) 6.272  7.451 34.29  2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.807   Deviance explained = 81.4%
GCV = 16.948  Scale est. = 16.319    n = 506```

However, when I view the summary of my GAM model using the code:

summary(preGAM)

I receive a similar output (shown below), however, my output only shows the intercept and does not show the parametric coefficients that are shown in the example

Family: gaussian 
Link function: identity 

Formula:
postOutlierRem.dat$DailyIncrease ~ s(postOutlierRem.dat$Precipitation)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.102905   0.009653   10.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                      edf Ref.df     F p-value
s(postOutlierRem.dat$Precipitation) 8.934  8.999 413.6  <2e-16
                                       
s(postOutlierRem.dat$Precipitation) ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.872   Deviance explained = 87.4%
GCV = 0.052103  Scale est. = 0.05116   n = 549```

How do I find the estimate for the coefficient "Precipitation" in my model? I want to be able to run two different GAM models and see how the estimate of the coefficient "Precipitation" varies between my two models. I appreciate any feedback that anyone has. TIA.

summary() is a generic function that provides a facility to extract elements from an object in accordance with some dispatch logic that is built into the object, usually an S3 object consisting of a list. Using

str(object)

will give you visibility into the list and then it's up to the user to extract those desired and place them in a new object for display. It's a pita, but that's how it is. {broom} may, or may not, have a tidier that does what you need.

Without a complete regex, including representative data, I'll use a different illustration:

fit <- lm(mpg ~ drat, mtcars)
summary(fit)
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05
str(fit)
#> List of 12
#>  $ coefficients : Named num [1:2] -7.52 7.68
#>   ..- attr(*, "names")= chr [1:2] "(Intercept)" "drat"
#>  $ residuals    : Named num [1:32] -1.42 -1.42 0.763 5.276 2.038 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ effects      : Named num [1:32] -113.65 -22.86 1.05 5.28 2.07 ...
#>   ..- attr(*, "names")= chr [1:32] "(Intercept)" "drat" "" "" ...
#>  $ rank         : int 2
#>  $ fitted.values: Named num [1:32] 22.4 22.4 22 16.1 16.7 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ assign       : int [1:2] 0 1
#>  $ qr           :List of 5
#>   ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>   .. .. ..$ : chr [1:2] "(Intercept)" "drat"
#>   .. ..- attr(*, "assign")= int [1:2] 0 1
#>   ..$ qraux: num [1:2] 1.18 1.09
#>   ..$ pivot: int [1:2] 1 2
#>   ..$ tol  : num 1e-07
#>   ..$ rank : int 2
#>   ..- attr(*, "class")= chr "qr"
#>  $ df.residual  : int 30
#>  $ xlevels      : Named list()
#>  $ call         : language lm(formula = mpg ~ drat, data = mtcars)
#>  $ terms        :Classes 'terms', 'formula'  language mpg ~ drat
#>   .. ..- attr(*, "variables")= language list(mpg, drat)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "mpg" "drat"
#>   .. .. .. ..$ : chr "drat"
#>   .. ..- attr(*, "term.labels")= chr "drat"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(mpg, drat)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. ..- attr(*, "names")= chr [1:2] "mpg" "drat"
#>  $ model        :'data.frame':   32 obs. of  2 variables:
#>   ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>   ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ drat
#>   .. .. ..- attr(*, "variables")= language list(mpg, drat)
#>   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : chr [1:2] "mpg" "drat"
#>   .. .. .. .. ..$ : chr "drat"
#>   .. .. ..- attr(*, "term.labels")= chr "drat"
#>   .. .. ..- attr(*, "order")= int 1
#>   .. .. ..- attr(*, "intercept")= int 1
#>   .. .. ..- attr(*, "response")= int 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. .. ..- attr(*, "predvars")= language list(mpg, drat)
#>   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "drat"
#>  - attr(*, "class")= chr "lm"
fit$effects
#>   (Intercept)          drat                                           
#> -113.64973741  -22.85783135    1.05437313    5.27928002    2.06792484 
#>                                                                       
#>    4.31690365   -2.77037959    3.82318495    0.54301796   -3.05698204 
#>                                                                       
#>   -4.45698204    0.35233075    1.25233075   -0.84766925   -4.62495890 
#>                                                                       
#>   -5.13631407   -2.51648106    8.97420614    0.76489335    9.45149580 
#>                                                                       
#>    0.85013422    1.71690365   -1.43207516   -7.56901800    3.07928002 
#>                                                                       
#>    3.87420614    0.01743029    9.23877904   -8.64850420   -0.36545987 
#>                             
#>   -4.48105397   -2.24494607
summary(fit)
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

Created on 2024-01-04 with reprex v2.0.2

Altough fit$effects doesn't show up in summary() it can still be extracted.

I have included csv files of some of my data, I have also provided my code below:

# import the data 

preOutlierRem.dat <- read.csv("PrecipDischargeOutliersRemPre.csv")
postOutlierRem.dat <- read.csv("PrecipDischargeOutliersRemPost.csv")

preOutlierRem.dat # view the data 
postOutlierRem.dat # view the data 

# create a GAM model for each dataset 
preGAM = gam(preOutlierRem.dat$DailyIncrease ~ s(preOutlierRem.dat$Precipitation))
postGAM = gam(postOutlierRem.dat$DailyIncrease ~ s(postOutlierRem.dat$Precipitation))

# view the summary of each GAM model
summary(preGAM) # r-squared = 0.908
summary(postGAM) # r-squared = 0.872

# plot each GAM model 
windowsFonts(A = windowsFont("Times New Roman"))
plot(preGAM, residuals = TRUE, xlab = "Daily Precipitation Depth (mm)", ylab = "Increase in Daily Peak Discharge (m3/s)", family = "A", ylim = c(0,15), xlim = c(0,60), pch = 19, cex = .5)
plot(postGAM, residuals = TRUE, xlab = "Daily Precipitation Depth (mm)", ylab = "Increase in Daily Peak Discharge (m3/s)", family = "A", ylim = c(0,15), xlim = c(0,60), pch = 19, cex = .5)

Does anyone have any suggestions as to see how the see the parametric coefficient for "Precipitation" in my preGAM and postGAM models, as was shown in the blog?

Try

library(gam)
library(gt)
postOutlierRem.dat <- read.csv("precip.csv")
postGAM <- gam(postOutlierRem.dat$DailyIncrease ~ s(postOutlierRem.dat$Precipitation))
summary(postGAM) 
str(postGAM)
postGAM$coefficients
broom::tidy(postGAM) #|> 
  #gt()

Thank you so much, that worked

1 Like

Ok, I have done something stupid now, but I am not sure what I have done. As shown in my precious code, when I ran the code

summary(preGAM)

I received the following output

Family: gaussian 
Link function: identity 

Formula:
postOutlierRem.dat$DailyIncrease ~ s(postOutlierRem.dat$Precipitation)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.102905   0.009653   10.66   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                      edf Ref.df     F p-value
s(postOutlierRem.dat$Precipitation) 8.934  8.999 413.6  <2e-16
                                       
s(postOutlierRem.dat$Precipitation) ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.872   Deviance explained = 87.4%
GCV = 0.052103  Scale est. = 0.05116   n = 549```

As shown in the above output, the summary of the GAM model included the adjusted r-squared value (0.872) and deviance explained, 87.4%.

This morning, I tried rerunning the code from yesterday (code shown below)

preGAM = gam(preOutlierRem.dat$DailyIncrease ~ s(preOutlierRem.dat$Precipitation))
summary(preGAM)

When I run these two lines of code, I now get the following output:

summary(preGAM)

Call: gam(formula = preOutlierRem.dat$DailyIncrease ~ s(preOutlierRem.dat$Precipitation))
Deviance Residuals:
       Min         1Q     Median         3Q        Max 
-0.7948248 -0.0003444 -0.0003444 -0.0003444  1.6634846 

(Dispersion Parameter for gaussian family taken to be 0.0171)

    Null Deviance: 80.9564 on 545 degrees of freedom
Residual Deviance: 9.2586 on 541 degrees of freedom
AIC: -664.5967 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
                                    Df Sum Sq Mean Sq F value    Pr(>F)    
s(preOutlierRem.dat$Precipitation)   1 57.804  57.804  3377.6 < 2.2e-16 ***
Residuals                          541  9.259   0.017                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Anova for Nonparametric Effects
                                   Npar Df Npar F     Pr(F)    
(Intercept)                                                    
s(preOutlierRem.dat$Precipitation)       3 270.61 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As shown above, the summary for my preGAM model no longer shows the adjusted R--squared value or the deviance explained that was shown yesterday. Does anyone have any advice on how to find these adjusted R-squared values?

TIA

Do you mean

summary(postGAM)

See

Thank you. I have now created two GAM models (preGAM and postGAM). My preGAM model has an adjusted r-squared value of 0.905, and my postGAM model has an r-squared value of 0.866. I am fairly satisfied with these r-squared values, my professor indicated that anything over 0.8 is a great r-squared value for my model.

I plotted each GAM model (preGAM and postGAM) to see the trend of the residuals around the models and included a plot of the preGAM and postGAM models below.


Figure 1: preGAM Model Comparing Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s)

Figure 2: postGAM Model Comparing Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s)

I want to run some sort of regression to determine if the slope of the relationship between Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s), at different points along the curve changes between the preGAM and postGAM model. Does anyone have any advice as to how to achieve this?

Based on a visual assessment of the plots it appears as if the overall slope of the model is greater in the post-GAM scenario than it is in the preGAM scenario, but I want to be able to quantify this change through some sort of regression analysis. Does anyone have any experience with this or advice on how to assess changes in the overall trend of the two models?

TIA

What is your visual assessment if you plot the second on the same x-axis scale as the first?

Yes, I thought that it looked as if the slope of the model is greater in the postGAM scenario that it is in the preGAM scenario, however, I was wondering if there is a way to mathematically/statistically quantify this change. Here are the plots with the x-axis at the same scale as each other.


Figure 1: preGAM Model Comparing Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s)


Figure 2: postGAM Model Comparing Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s)

I'd be queasy trying to draw an inference based on the points above 25mm based on how few there are on the lower plot (4), although they are well within the confidence interval versus only 2 points inside on the upper, but that's just a gut reaction.

For the range 0-10 mm, if that's of interest, I'd need to think about.

1 Like

Yes, you are probably right, do you know if there is a simple way of removing these observations from my dataset using code rather than having to go through the .csv file and delete the observations with precipitation depth greater than 25 mm?
Thanks

We crossed.

Subsetting based on a logical condition is simple

new <- old[which(old$measure < 25),]```

Ok, I have now recreated the GAM models for precipitation events less than 15 mm, using the code:

prePrecip15 <- preOutlierRem.dat$Precipitation[which(preOutlierRem.dat$Precipitation < 15)]
preIncrease15 <- preOutlierRem.dat$DailyIncrease[which(preOutlierRem.dat$Precipitation < 15)]

postPrecip15 <- postOutlierRem.dat$Precipitation[which(postOutlierRem.dat$Precipitation < 15)]
postIncrease15 <- postOutlierRem.dat$DailyIncrease[which(postOutlierRem.dat$Precipitation < 15)]```

In the above code, prePrecip15 is all precipitation events before treatment that are greater than 15 mm, preIncrease15 is the corresponding increase in stream discharge during these precipitation events before treatment, postPrecip15 is all precipitation events after treatment that are greater than 15 mm, and postIncrease15 is the corresponding increase in stream discharge after treatment.

I then created two GAM models (one for before treatment, and the other for after treatment), looking at the relationship between the precipitation depth and the increase in stream discharge using the code:

preGAM15 <- gam(preIncrease15 ~ s(prePrecip15), method = "REML")
postGAM15 <- gam(postIncrease15 ~ s(postPrecip15), method = "REML")```

I viewed the r-squared values of each model using the summary() function and found that preGAM15 has an r-squared value of 0.623 and postGAM15 has an r-squared value of 0.617, which isn't terrible (I think it is the best I can get).

I then plotted each model (Figures 1 and 2, shown below)


Figure 1: preGAM15 Model Comparing Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s)


Figure 2: postGAM15 Model Comparing Daily Precipitation Depth (mm) and Increase in Daily Peak Stream Discharge (m3/s)

I am now trying to determine if there is a difference in the slope of the regression curve between the two GAM models (i.e., preGAM15 and postGAM 15). Does anyone have any suggestions as to how to determine/quantify the regression slope of these models to determine the underlying relationships between the daily precipitation depth (mm) and the increase in daily peak stream discharge (m3/s)?

TIA

I can see that the slope coefficients are different, but am not sure what this means for the overall model

preGAM15$coefficients
     (Intercept) s(prePrecip15).1 s(prePrecip15).2 s(prePrecip15).3 s(prePrecip15).4 s(prePrecip15).5 s(prePrecip15).6 
      0.01890256       0.08645833      -0.24569476       0.13332839       0.19350182       0.13919646      -0.18811312 
s(prePrecip15).7 s(prePrecip15).8 s(prePrecip15).9 
     -0.15779142      -0.83407930       0.18684002 
> postGAM15$coefficients
      (Intercept) s(postPrecip15).1 s(postPrecip15).2 s(postPrecip15).3 s(postPrecip15).4 s(postPrecip15).5 
       0.03787266        0.10965270       -0.25323881       -0.12089909        0.09485919       -0.10715076 
s(postPrecip15).6 s(postPrecip15).7 s(postPrecip15).8 s(postPrecip15).9 
      -0.16011382       -0.13397841       -0.64934405        0.20359007 ```

You can test the fits with anova().

update

Ignore the rest of the post down to the bottom, where there is a link that might be useful. Read @ucfagls answers instead of mine.

For the test on fitted models be sure to understand the null, which differs from the case when use to test means.

  1. Hypothesis Statement:
  • H0H0​: The simpler model (Model 1) is sufficient, and the additional predictors in the more complex model (Model 2) do not provide a significantly better fit to the data. In other words, the null hypothesis posits that the additional predictors' coefficients are zero.
  • H1H1​ (Alternative Hypothesis): The more complex model (Model 2) provides a significantly better fit to the data than the simpler model (Model 1), meaning that at least one of the additional predictors' coefficients is non-zero.
  1. Example:
  • Suppose Model 1 predicts a dependent variable Y based on an independent variable X1, and Model 2 predicts Y based on X1 and another variable X2. The null hypothesis in comparing these models via ANOVA would be that X2 does not significantly improve the prediction of Y when considering X1.

In summary, the null hypothesis for an ANOVA comparing two fitted models is that the simpler model is adequate and that the additional complexity of the more complex model does not significantly improve the model's fit to the data. This is tested by comparing the residual sum of squares of the two models.

The null hypothesis in a statistical test is rejected when the p-value of the test is less than or equal to a pre-determined significance level, denoted as ( \alpha ). The significance level, ( \alpha ), is the threshold for deciding whether a result is statistically significant. It represents the probability of rejecting the null hypothesis when it is actually true (Type I error).

Here's how you determine whether to reject the null hypothesis for a given ( \alpha ):

  1. Choose the Significance Level (( \alpha )): Before conducting the test, you choose a significance level. Common choices for ( \alpha ) are 0.05, 0.01, and 0.10. A lower ( \alpha ) means stricter criteria for rejecting the null hypothesis.

  2. Conduct the Statistical Test: Perform the ANOVA on the data. This test will yield a p-value.

  3. Compare the p-value with \alpha:

    • If the p-value is less than or equal to \alpha (p ≤ \alpha), you reject the null hypothesis. This indicates that the observed effect is statistically significant, and the likelihood of observing such an effect by chance is low.
    • If the p-value is greater than \alpha (p > \alpha), you fail to reject the null hypothesis. This means that there isn't enough evidence in your data to conclude that the effect is statistically significant.

For example, if your significance level \alpha is 0.05 and the p-value of your test is 0.03, you would reject the null hypothesis because 0.03 is less than 0.05. However, if the p-value were 0.07, you would not reject the null hypothesis since 0.07 is greater than 0.05.

It's important to note that failing to reject the null hypothesis does not prove that the null hypothesis is true. It simply indicates that there is not enough evidence against the null hypothesis given the data and the chosen level of significance.

You might also wish to consider a dynamic GAM approach. See Clark, N. J., & Wells, K. (2023). Dynamic generalised additive models (DGAMs) for forecasting discrete ecological time series. Methods in Ecology and Evolution, 14(3), 771-784. .

I would strongly advise not to use the gam package over the mgcv package; the latter provides automatic smoothness selection whereas in the former you need to specify the degrees of freedom for the smooths. The blog post used the mgcv package, but you are misunderstanding the output there.

Note the formula that was used in that blog post:

medv ~ s(rm) + s(lstat) + ptratio + indus + crim + zn + age

Only two of the terms in this model are smooths: s(rm) and s(lstat). The other terms in the formula are parametric effects. That is why you see more entries in the Parametric coefficients table in the summary() output. There is nothing about rm and lstat in the Parametric coefficients table.

In your model, which should be specified as

DailyIncrease ~ s(Precipitation)

There is only a single model term and it is a smooth. As you have no parametric terms beyond the model constant term (the intercept), there will only be a single line in the Parametric coefficients table. The focus of inference should be on the smooth function itself and that is what is show in the Approximate significance of smooth terms table. Note that you shouldn't use df$var notation in model formulas; pass the df to the data argument in your model, hence

gam(DailyIncrease ~ s(Precipitation), data = preOutlierRem.dat, method = "REML")

or

gam(DailyIncrease ~ s(Precipitation), data = preOutlierRem.dat, method = "ML")

would be advised.

The smooth in your model is represented as a penalised thin plate spline; a spline is a function that is formed by a weighted sum of basis functions. Your spline was built using 9 thin plate spline basis functions. Each of those functions has a coefficient associated with it: run coef() on your model and you'll see the nine coefficients associated with the smooth, plus the intercept.

As the whole point of fitting a GAM is to estimate smooth functions, we should consider the estimates of those functions. There is little to no point in looking at the individual coefficients. This is why there is a single line in the model summary for the smooth; this is a Wald-like test against a null hypothesis that the true function is a flat, constant function with value 0 everywhere: \text{H}_0: f(x_i) = 0 \; \forall \; i.

Apart from that, we should plot the estimated smooth; you can do this with plot(model) using mgcv or use another package for plotting smooths fitted by mgcv, such as my gratia package or the itsadug package. With gratia you would just do draw(model) to get a nice ggplot-based version of the figure plot() would produce.

How do I find the estimate for the coefficient "Precipitation" in my model?

In summary, you don't; these models are not parameterised in the way you are expecting. There isn't a single coefficient for the smooth term. There are many coefficients associated with a smooth and it doesn't make sense in most cases to compare coefficients.

I want to be able to run two different GAM models and see how the estimate of the coefficient "Precipitation" varies between my two models.

This implies that you have the response variable DailyIncrease measured at two different conditions or settings. If so, you should fit one model, where we allow the smooth effect of the covariate on the response to differ between the two settings. The standard way to do this is shown below, in which I am assuming that you have joined your two data frames row-wise (into combined_dat) and that there is a variable prepostthat is a factor coding for whether we are using the pre or the post data set.

gam(DailyIncrease ~ prepost  + s(Precipitation, by = prepost),
  data = combined_dat, method = "REML")

This model fits two separate smooths, one where prepost == "pre" and one where prepost == "post"; you can then plot the two functions; with gratia::draw() or mgcv::plot.gam() for a visual comparison.

With this model there isn't a test for the difference. You can use gratia::difference_smooths() to compute a difference between the two estimated smooths and an associated uncertainty band. This can then be plotted to show where over the covariate the two functions differ.

An alternative model formulation is via an ordered factor, or via the sz basis. I'll show examples of these below.

Using your data I worked up the following example

Your original model

First load the data and packages we'll use. If you don't have gratia installed, follow the instructions here to install the development version

# packages
library("readr")
library("gratia")
library("mgcv")
library("dplyr")

# load data
pre  <- read_csv("PrecipDischargeOutliersRemPre.csv", col_types = "Ddddd-")
post <- read_csv("PrecipDischargeOutliersRemPost.csv", col_types = "Ddddd-")

Fit your original model, properly specified:

# original model
m_orig <- gam(DailyIncrease ~ s(Precipitation),
    data = pre, method = "REML")

The model summary is

# summary
summary(m_orig)
>  summary(m_orig)

Family: gaussian 
Link function: identity 

Formula:
DailyIncrease ~ s(Precipitation)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.068637   0.005005   13.71   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                   edf Ref.df     F p-value    
s(Precipitation) 8.855  8.993 598.1  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.908   Deviance explained = 90.9%
-REML = -366.58  Scale est. = 0.013678  n = 546

and plot the estimated smooth

# plot
draw(m_orig, residuals = TRUE)

Note that a few large values of Precipitation are driving the wiggles in the function. We really need to spread these out. One way is to square root transform the values (you have 0s so a log transform is out of the question - more on this later). Below I repeat the above steps on square root transformed data:

m_orig_sqrt <- gam(DailyIncrease ~ s(sqrt(Precipitation)),
    data = pre, method = "REML")
summary(m_orig_sqrt)
> summary(m_orig_sqrt)

Family: gaussian 
Link function: identity 

Formula:
DailyIncrease ~ s(sqrt(Precipitation))

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.06864    0.00559   12.28   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                         edf Ref.df     F p-value    
s(sqrt(Precipitation)) 8.287  8.849 474.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.885   Deviance explained = 88.7%
-REML = -313.63  Scale est. = 0.01706   n = 546
# plot
draw(m_orig_sqrt, residuals = TRUE)

This is better but not perfect. Note however that the excessive wiggliness in the estimated function has been removed.

Models for both data sets

Here I show three different ways to model both of the data sets in one model.

Start by combining the data into a single data frame

# combine into a single data set with a prepost variable coding for the data set
# I also square root transform precip here to avoid doing it in the model
combined <- pre |>
    bind_rows(post) |>
    mutate(prepost = rep(c("pre", "post"), times = c(nrow(pre), nrow(post))),
        prepost = factor(prepost, levels = c("pre", "post")),
    root_precip = sqrt(Precipitation))

Factor by smooths

First the factor by model. Note that we need to include the by factor as a parametric term to model the group means.

# fit the factor by model
m_f_by <- gam(DailyIncrease ~ prepost + s(root_precip, by = prepost),
    data = combined, method = "REML")

This model has fitted two separate smooths, one per data set, as shown in the model output:

> summary(m_f_by)

Family: gaussian 
Link function: identity 

Formula:
DailyIncrease ~ prepost + s(root_precip, by = prepost)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.063371   0.008359   7.581 7.36e-14 ***
prepostpost 0.052715   0.011808   4.464 8.88e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                             edf Ref.df     F p-value    
s(root_precip):prepostpre  7.847  8.441 224.0  <2e-16 ***
s(root_precip):prepostpost 8.672  8.957 548.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.862   Deviance explained = 86.4%
-REML = -190.21  Scale est. = 0.037884  n = 1095

The two lines for the smooth terms are separate tests of \text{H}_0: f(x_i) = 0 \; \forall \; i.

The two fitted smooths are visualised using

draw(m_f_by, grouped_by = TRUE)

As such there is nothing in this model that shows the difference between the two smooths. We can, however, compute this difference because we fitted both smooths in a single model:

difference_smooths(m_f_by, smooth = "s(root_precip)") |>
    draw()

This is showing that the pre smooth is lower than the "post" smooth around 4 root precip and at large values of root precip. It's not too difficult to plot this on the not-square-root scale but I won't show that as this is long enough already.

Ordered factor by smooths

Another option that does include an estimate of the difference is to set the model up using and ordered factor. In this model we fit a reference smooth and then smooth differences for the other levels. First we need to form the ordered factor and set the contrasts on this term to the usual treatment contrasts for easier interpretation

# next we fit the ordered factor version. Here I have said that `pre` is the
# reference level, which is important
combined <- combined |>
    mutate(prepost_o = ordered(prepost, levels = c("pre", "post")))
# set the contrasts to the usual treatment one
contrasts(combined$prepost_o) <- "contr.treatment"

Then we fit the model

m_ord_by <- gam(DailyIncrease ~ prepost_o +
        s(root_precip) + s(root_precip, by = prepost_o),
    data = combined, method = "REML")
summary(m_ord_by)
> summary(m_ord_by)

Family: gaussian 
Link function: identity 

Formula:
DailyIncrease ~ prepost_o + s(root_precip) + s(root_precip, by = prepost_o)

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.062932   0.008373   7.516 1.18e-13 ***
prepost_opost 0.052965   0.011832   4.476 8.41e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                               edf Ref.df      F p-value    
s(root_precip)               7.627  8.267 230.61  <2e-16 ***
s(root_precip):prepost_opost 8.318  8.751  59.89  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.861   Deviance explained = 86.3%
-REML = -191.1  Scale est. = 0.038043  n = 1095

and visualise it using

draw(m_ord_by)

The second plot shown is how the smooth of root_precip for the "post" group differs from the reference smooth (for the "pre" group); again we see the tendency for the "post" group to be larger than the pre group.

Constrained factor smooths

Another option is the constrained factor smooth basis "sz". This basis allows us to fit a GAM with a "main effect" smooth and then difference smooths for the levels of a factor, which show how each level differs from the average or "main effect" smooth.

m_sz <- gam(DailyIncrease ~ s(root_precip) +
        s(root_precip, prepost_o, bs = "sz"),
    data = combined, method = "REML")

# model summary
summary(m_sz)

Note that

  1. in this model we do not need the parametric factor term (the group means are included in the sz basis
  2. we set the special basis using the bs argument
  3. the first smooth in the model is the average or "main effect" smooth
  4. the second smooth represents two smooth differences from this average smooth, one per level of the factor.

The fitted model is plotted using

draw(m_sz)

Here, the second plot looks weird because you only have a two levels, so they are mirror images of one another.

Two separate models

If all you want to do is compare smooths, then you can do it visually with gratia::compare_smooths():

m_pre <- gam(DailyIncrease ~ s(root_precip),
    data = combined |> filter(prepost == "pre"), method = "REML")
m_post <- gam(DailyIncrease ~ s(root_precip),
    data = combined |> filter(prepost == "post"), method = "REML")

compare_smooths(m_pre, m_post) |>
    draw()

…but not so fast

The above walk through and your approach in general is ignoring some pretty important features of your data which you need to account for.

Issues

The main issues with your approach are:

  1. you're ignoring the time aspect - these data are collected in time order, and are thus not independent. All the intervals around the functions shown in the above figures are simply wrong - to narrow - because they assumed that the data were, conditional upon the model, independent.

  2. as you currently have it, your response can't be lower than zero. You are truncating the data at 0 because a negative increase is a decrease. Such data can't be Gaussian, which allows for negative values. You can see the problems this causes in some of the figures, where the estimated smooth goes negative. You would need to fit a model that doesn't admit negative values, such as the Tweedie:

     m_tw_f_by <- gam(DailyIncrease ~ prepost + s(root_precip, by = prepost),
         data = combined, method = "REML", family = tw())
    

    But that also doesn't account for the first problem.

It seems like you are causing more problems for yourself by processing the data two early. Why not model the discharge itself as a function of time and include the instantaneous effect of precipitation? You might get more specific help if you spell out what you are trying to do in general rather than tackle this one specific aspect.

But, for example:

combined <- combined |>
    mutate(year = format(Date, "%Y") |> as.numeric(),
        fyear = factor(year),
        doy = format(Date, "%j") |> as.numeric())

m_time <- bam(Discharge ~ prepost +   # treatment effect
        s(fyear, bs = "re") +         # random year effect
        s(doy, by = fyear, k = 100) + # smooth of time within year
        s(root_precip, by = prepost), # smooth effect of precip
    data = combined,
    method = "fREML",
    family = tw(link = "log"),
    discrete = TRUE, nthreads = 3)

You have one observation with 0 discharge observed which seems unlikely?

> combined |> filter(!Discharge > 0)                                                               
# A tibble: 1 × 11
  Date       Discharge Precipitation DailyChange DailyIncrease prepost root_precip prepost_o  year
  <date>         <dbl>         <dbl>       <dbl>         <dbl> <fct>         <dbl> <ord>     <dbl>
1 2012-05-28         0            17        5.56             0 pre            4.12 pre        2012
# ℹ 2 more variables: fyear <fct>, doy <dbl>

Hence the Tweedie, which allows 0s.

This estimated smooths are:

You can then estimate quantities of interest from that model, such as what is the effect on the discharge of precipitation in the two settings?

This is just for illustration; you'd want to check lots of assumptions here, and there may well still be unmodelled autocorrelation for example.

What I have done in this model is estimate the raw discharge values as a function of:

  1. the day of year within year - s(doy, by = fyear, k = 100) to account for the different discharge patterns within each year that you observed. k is set large to allow for the rapid changes and peaks in discharge,
  2. the year of observation - s(fyear, bs = "re") which is a random intercept for year as with the big gap it seems unreasonable to model these are a single time series,
  3. the effects of precipitation on discharge - s(root_precip, by = prepost) which is allowed to vary between the two periods,
  4. the treatment mean effects - prepost

As this model is getting a bit large I turned on some tricks to model this using a fast-fitting approach using multiple CPU cores. Note the use of bam() over gam(), discrete = TRUE, method = "fREML", and nthreads = 3..

To test if the slopes of the two precip curves differ, you could compute the derivatives of those two smooths using gratia::derivatives(). This would be the derivative on the log scale however. To get this value on the response scale we'd need to use posterior simulation via fitted_samples() or possibly the new function response_derivatives(). The former would allow you to estimate a difference of derivatives but you'd need to compute the derivatives (using finite differences) yourself for each draw from the posterior. It's not hard to do this, once you know the recipe.

The latter, response_derivatives() would compute the derivatives on the response scale. For this you need the development version of gratia:

discharge_d <- response_derivatives(m_time, focal = "root_precip",
    data = ds, n_sim = 10000, seed = 10,
    exclude = sms[c(1:7)])

discharge_d
> discharge_d                                                                                      
# A tibble: 200 × 9
    .row .focal      .derivative .lower_ci .upper_ci prepost fyear   doy root_precip
   <int> <chr>             <dbl>     <dbl>     <dbl> <fct>   <fct> <dbl>       <dbl>
 1     1 root_precip     0.0613     0.0179    0.131  pre     2020    213      0     
 2     2 root_precip    -0.0124    -0.0785    0.0467 post    2020    213      0     
 3     3 root_precip     0.0630     0.0186    0.135  pre     2020    213      0.0738
 4     4 root_precip    -0.0118    -0.0752    0.0467 post    2020    213      0.0738
 5     5 root_precip     0.0646     0.0200    0.139  pre     2020    213      0.148 
 6     6 root_precip    -0.0108    -0.0706    0.0460 post    2020    213      0.148 
 7     7 root_precip     0.0660     0.0218    0.142  pre     2020    213      0.221 
 8     8 root_precip    -0.00942   -0.0649    0.0451 post    2020    213      0.221 
 9     9 root_precip     0.0672     0.0239    0.144  pre     2020    213      0.295 
10    10 root_precip    -0.00759   -0.0583    0.0433 post    2020    213      0.295 
# ℹ 190 more rows
# ℹ Use `print(n = ...)` to see more rows

and then plot

discharge_d |>
    mutate(precip = root_precip^2) |>
    ggplot(aes(x = precip, y = .derivative, group = prepost)) +
    geom_ribbon(aes(ymin = .lower_ci, ymax = .upper_ci, fill = prepost),
        alpha = 0.2) +
    geom_line(aes(colour = prepost), linewidth = 2) +
    facet_wrap(~ prepost, scales = "free_y")

which is consistent with the estimated smooths:

Note the derivative is positive in the post setting, except for the precip > 40 (root_precip > ~6.3), which could be fixed a little by reducing k to be lower than the default to allow less wiggliness.

The above is just for illustration; a proper analysis would need to delve in the modelling more.

You are using the gam package now; previously you were using the mgcv package. These give very different output, not just the R^2 and deviance explained no longer being present.

That's not doing what you (or maybe the OP) thinks; you've switched gam() functions to use the one from the gam package.

What that parametric coefficient is for is the linear term in the basis - it isn't representative of the smooth as a whole and arguably shouldn't be pulled out and treated separately (but who am I to argue with the people that invented GAMs). mgcv treats this term as being part of the estimated smooth function (which it is!). I don't quite follow why the gam package separates these effects; I suspect it's because they want to show and test the wiggliness bit of the estimated smooth, but you can't take the parametric coefficient and use it in any sensible way from a model like this.

As I mentioned in my long reply, the extreme precipitation values are having a disproportionate effect on the estimated smooths. You need to transform precipitation I think during fitting to get better estimates of the smooth functions.

If you'd fitted this as a single model, you could compute the derivatives and thence a difference of derivatives directly using posterior simulation. Right now, if you;d fitted separate GAMs using mgcv you could use my gratia package to compute the derivatives (slopes) of those smooths using the derivatives() function. But to compare them statistically, you need to fit both periods/conditions in the one model.

This isn't a good strategy; you're not going to get the same estimated function for the interval of interest if you fit to the whole data and to the subset of data. You should fix the issue with the skewed precipitation values (like by taking a square root as per my other answer) and then compute derivatives on the full function, but only look at the part < 15mm.

I also hope you had the idea to do this test before you looked at the model output? If not (be honest!) then you need to flag this analysis as being exploratory when you write it up as you have double-dipped.

You can, but that isn't going to help the OP here if they have two separate models. The anova() here is doing a (generalized) likelihood ratio test and that is formed by assuming you can go from the more complex model to the simple model by setting some of the coefficient in the more complex model to 0. It's also predicated on using the same response data, which is also not the case here. As such this test simply isn't going to work.

1 Like