# Find the optimal model using step wise linear regression

I have a `sf` object with 2 columns, named fclass and geometry. I am interested in the fclass column. The fclass has several unique values. I convert this `sf` object to `spatraster`, I resample it and I perform a linear regression where I use another `spatraster`, called ntl, as my response (`lm(ntl ~ road)`).

My goal is to find which unique values give the largest r-squared in the lm model using stepwise linear regression whith backward elimination. My initial try was:

``````library(pacman)

ntl <- rast("path/ntl.tif") # response

# baseline r2
vterra <- vect(v)

ref <- rast("path/pop.tif") # get ext and pixel size

r <- rast(v, res = res(ref), ext = ext(ref))

x <- rasterizeGeom(vterra, r, "length", "m")

x_res <- resample(x, ntl, "average")

s <- c(ntl, x_res)

m <- lm(ntl ~ road, s, na.action = na.exclude)

classes <- unique(v\$fclass)
inclasses <- unique(v\$fclass)

i <- 1
while (i <= length(classes)) {
class <- classes[i]
print(paste0("current class ", class))
print(paste0("orig baseline: ", orig_baseline, " - baseline: ", baseline))
print(classes)

v_filtered <- v[v\$fclass != class, ]
vterra <- vect(v_filtered)
r <- rast(v, res = res(ref), ext = ext(ref))
x <- rasterizeGeom(vterra, r, "length", "m")

x_res <- resample(x, ntl, "average")

s <- c(ntl, x_res)

m <- lm(ntl ~ road, s, na.action = na.exclude)

if (class_r2 > baseline) {
classes <- classes[-i]
baseline <- class_r2
} else {
print("class_r2 is less than baseline")
print(paste0("class_r2: ", class_r2, " - baseline: ", baseline))
i <- i + 1
}
}
``````

but it's not efficient (it takes ~ 5 minutes).

I have found the `MASS` package which has the `lmStepAIC` function but I am not sure how to select the unique values from the fclass column to use as predictiors.

``````head(v, 6)
Simple feature collection with 6 features and 1 field
Geometry type: MULTILINESTRING
Dimension:     XY
Bounding box:  xmin: 598675.9 ymin: 7111459 xmax: 609432.8 ymax: 7118729
Projected CRS: WGS 84 / UTM zone 35S
fclass                       geometry
1     secondary MULTILINESTRING ((598675.9 ...
2     secondary MULTILINESTRING ((600641.7 ...
3   residential MULTILINESTRING ((601734.8 ...
4   residential MULTILINESTRING ((601163.9 ...
5   residential MULTILINESTRING ((601104.2 ...
``````

The unique values in the fclass column:

``````unique(v\$fclass)
[1] "secondary"      "residential"    "motorway_link"  "service"        "primary"        "unclassified"   "motorway"
``````

If necessary I can share the data from my GoogleDrive .

I have two comments on this:

• be wary of automated variable selection; it is a method known to present nonsensic models. Especially when used for datasets with (relatively) many columns and (relatively) few observations. Been there, tried that, got burnt, have the T-shirt..
• if you absolutely positively need to try it out (if for no other reason than to see for yourself the error of your ways - wink wink, not meaning this personally) consider the {leaps} package.

{leaps} has a nice function `regsubsets` that produces a summary which can be queried via `which.min()` for AIC and the like or `which.max()` for (adjusted) R² and related metrics.

To make the regression run faster you may want to strip your data of the geometry column via `st_drop_geometry()` call.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.