Hello, I need help identifying if a predictor variable needs to be a fixed effect, a random effect, or if both are necessary. I understand a fixed effect to mean "a variable of interest" and a random effect to be something that represents a structural component, like a sample design (good for repeated measures designs and captures similarities between observations). I've also learned random effects can't be factors with less than 5-10 levels. Is there a scenario where a single variable is required to be both a fixed and random effect?
My design is that I collected fish in nets once per site (47 sites) and I do this once per season (wet and dry; so 2 times per year). My question is: do I need all these random effects (I've added my reasoning for adding them after each one, but maybe some are unnecessary/overly complicated)? Does the sample design force me to keep season and year as random effects in addition to my fixed effects? Can season be a random effect even it only has 2 levels? Is there an alternative to make things less complicated?
num = counts of fish
CYR.std = standardized calendar year (2007 = 1, 2008 =2, etc..)
fCYR = factor year
fSeason = factor season (dry,wet)
fSite = 47 sites per season
Model syntax:
mod <- bam(num ~
# Parametric terms (interested in a long-term trend that varies by season)
CYR.std * fSeason +
# Habitat covariates (ignore)
sed_depth * ave_hw +
total_ave_ma +
s(ave_tt) +
# Structural components
s(fSite, bs = "re") +
# No repeated measures per fSite (counts aggregated), BUT a random intercept
# says: each site can have it's own starting position in terms of abundance. If not
# random, then we are explicitly saying all sites start with the same abundance.
# It makes more sense for abundance to have a random starting point (in this case).
s(fSite, fCYR, bs = "re") +
# The station:year interaction captures the correlation (repeated measure)
# between the 2 measurements per year, at each site. Captures the broader
# level variation among sites and years.
s(fSite, CYR.std, bs = "re") +
# Each site can have it's own trend in abundance, over continuous time
s(fSeason, fCYR, bs = "re") +
# Captures correlation between observations within the same season and year
offset(log(area_sampled)),
data = toad2,
method = 'fREML',
discrete = TRUE,
family = poisson,
control = list(trace = TRUE))
Let's see if I understand the question correctly. I interpret it as one of repeated measures of a count variable, Y a binary season variable X_1, a continuous time variable X_2 and four continuous habitat variables, X_3 \dots X_6 taken at 47 different sites. You would like to be in a position to estimate the count, given X_{1\dots6} at a location 48.
However, the site dynamics give rise to potentially different drivers of X. Two sites with equal sediment depths may, for example, differ in sediment composition depending on stream competence and basin regolith. In this illustration, sediment depth is a fixed effect and site location is a random effect that accounts for differences among sites that are otherwise similar. The form of model in the {lme4} package
library(lme4)
model <- lmer(count ~ season + var1 + var2 + var3 + var4 + (1 | subject_location), data = your_data)
for the simple case in which each level of the grouping factor has its own intercept.
There remain the temporal aspects of your data, which present the possibility of autocorrelation, that needs be addressed: Are the observations of a station in the same season of successive years correlated. Chapter 7 of REGRESSION MODELING STRATEGIES with Applications to Linear Models, Logistic Regression, and Survival Analysis by FE Harrell. The book was published June 5 2001 by Springer New York, ISBN 0-387-95232-2.
Thank you! Yes, this is a repeated measures design at locations (sites) that are the same each year and season. I would like to estimate counts given habitat/water covariates x1-6 "for a typical site". Now, since this is not a random selection of sites, a "typical" site is restricted to 1 of 47 (which is fine) - the random intercept for site works perfectly here.
My main confusion was for the temporal aspects (year*season) as they are both "variables of interest" (i.e. I want to know the long-term trend of counts, by season), but do I also need a random intercept or slope for year and season (to capture the correlation of observations within the same year and season)? Is a random effect the only way to do this or do fixed effects do this too (and I only need EITHER a fixed or random effect for year and season? I was told I need both fixed and random effects because they do separate things, but maybe this isn't true. What would a random effect look like in the lme4 syntax?
I found out that it depends on what syntax is being used! When using s(year, season, bs = 'fs'), no random intercept or separate fixed effect is needed. When using the s(year, by = fSeason), one or the other is also required (to represent the mean of each season)!