The best place to start is with this CRAN Taskview to get an idea of the range of tools available. From there, I suggest reading the lmer4
vignette. Discrete Data Analysis with R or a similar text should also be consulted if well-being is a discrete (categorical) variable. Another resource is the dagR
package for directed acyclic graphs to analyze potential mediating and confounding effects. See this review of the topic in Judea Pearl.
A preliminary analysis can help in selecting a modeling approach. Consider f(x) = y, where
f, x and y are R
objects, which may, and often are composites of other objects.
For the study, x might be a data.table
or data.frame
object containing rows representing subjects and columns representing observations of the variables. Let Y be the response variable well being. Although well being possibly varies continuously from barely alive to healthy and happy across a variety of measures—BMI, performance on standardized tests, connectedness and centrality in social network graphs, etc., Y might be assignment to one of a few ordinal categories. The X—predictor—variables, might be school, age cohort, sport, minutes.
y—the return value of f— some measure of the association of X and Y in x. That is *given observed Y what can we infer about unobserved X and what is the relative contribution of Y_i \dots Y_n to what we can infer? (Or the interest may be in the change, from start to finish or from week to week, of Y.)
What choice of a function object—f—is appropriate to the need? That is, what model is appropriate. Taking the naive approach, we might think that a linear combination of the terms of X would tell something. For simplicity, assume Y is binary, rather than categorical and let's apply linear regression using one continuous variable, X1
fit <- lm(Y ~ X1, data = our_data)
We would get a model, but the diagnostics would prove problematic
because Y can take only one of two values—it's binary. This can be seen clearly by a simple scatterplot.
In other words, ordinary least squares regression is an inappropriate tool for a problem involving a binary response variable. For that we want an f in the form of
fit <- glm(Y ~ X1, data = our_data)
and after some further work, we get a profile log-likelihood result.
(Examples taken from here.)
However unless well being represented by Y is simply good or poor, the usual logistic regression model does not fit, either, because there may be poor, subpar, neutral, healthy and excellent to set some. For that case there is ordinal logistic regression. It comes in two flavors: proportional odds and forward continuation ratio. See Harrell ch 13.
There are two further complications.
The ordinality assumption must be satisfied—Y behaves in an ordinal fashion with respect to each X.
The time series nature of the data, which may introduce make the value of Y at t_k dependent upon the value at t_j. (See the case study by Harrell at §14.3).
Often time what is more of a challenge than how.