Find the optimal model using step wise linear regression

I have a sf object with 2 columns, named fclass and geometry. I am interested in the fclass column. The fclass has several unique values. I convert this sf object to spatraster, I resample it and I perform a linear regression where I use another spatraster, called ntl, as my response (lm(ntl ~ road)).

My goal is to find which unique values give the largest r-squared in the lm model using stepwise linear regression whith backward elimination. My initial try was:

library(pacman)
pacman::p_load(terra, sf, dplyr)

ntl <- rast("path/ntl.tif") # response

v <- st_read("path/road17.shp") # sf object

# baseline r2
vterra <- vect(v)

ref <- rast("path/pop.tif") # get ext and pixel size

r <- rast(v, res = res(ref), ext = ext(ref))

x <- rasterizeGeom(vterra, r, "length", "m")

x_res <- resample(x, ntl, "average")

s <- c(ntl, x_res)
names(s) <- c("ntl", "road")

m <- lm(ntl ~ road, s, na.action = na.exclude)
baseline <- sqrt(summary(m)$adj.r.squared)
orig_baseline <- sqrt(summary(m)$adj.r.squared)

classes <- unique(v$fclass)
inclasses <- unique(v$fclass)

i <- 1
while (i <= length(classes)) {
  class <- classes[i]
  print(paste0("current class ", class))
  print(paste0("orig baseline: ", orig_baseline, " - baseline: ", baseline))
  print(classes)
  
  v_filtered <- v[v$fclass != class, ]
  vterra <- vect(v_filtered)
  r <- rast(v, res = res(ref), ext = ext(ref))
  x <- rasterizeGeom(vterra, r, "length", "m")
  
  x_res <- resample(x, ntl, "average")
  
  s <- c(ntl, x_res)
  names(s) <- c("ntl", "road")
  
  m <- lm(ntl ~ road, s, na.action = na.exclude)
  class_r2 <- sqrt(summary(m)$adj.r.squared)
  
  if (class_r2 > baseline) {
    classes <- classes[-i]
    baseline <- class_r2
  } else {
    print("class_r2 is less than baseline")
    print(paste0("class_r2: ", class_r2, " - baseline: ", baseline))
    i <- i + 1
  }
}

but it's not efficient (it takes ~ 5 minutes).

I have found the MASS package which has the lmStepAIC function but I am not sure how to select the unique values from the fclass column to use as predictiors.

head(v, 6)
Simple feature collection with 6 features and 1 field
Geometry type: MULTILINESTRING
Dimension:     XY
Bounding box:  xmin: 598675.9 ymin: 7111459 xmax: 609432.8 ymax: 7118729
Projected CRS: WGS 84 / UTM zone 35S
         fclass                       geometry
1     secondary MULTILINESTRING ((598675.9 ...
2     secondary MULTILINESTRING ((600641.7 ...
3   residential MULTILINESTRING ((601734.8 ...
4   residential MULTILINESTRING ((601163.9 ...
5   residential MULTILINESTRING ((601104.2 ...
6 motorway_link MULTILINESTRING ((609432.8 ...

The unique values in the fclass column:

unique(v$fclass)
 [1] "secondary"      "residential"    "motorway_link"  "service"        "primary"        "unclassified"   "motorway"      
 [8] "tertiary"       "trunk"          "primary_link"   "footway"        "track"          "secondary_link" "unknown"       
[15] "living_street"  "pedestrian"     "path"           "bridleway"      "steps"          "trunk_link"     "track_grade1"  
[22] "track_grade3"   "track_grade5"   "cycleway"       "track_grade4"   "tertiary_lin

If necessary I can share the data from my GoogleDrive .

I have two comments on this:

  • be wary of automated variable selection; it is a method known to present nonsensic models. Especially when used for datasets with (relatively) many columns and (relatively) few observations. Been there, tried that, got burnt, have the T-shirt..
  • if you absolutely positively need to try it out (if for no other reason than to see for yourself the error of your ways - wink wink, not meaning this personally) consider the {leaps} package.

{leaps} has a nice function regsubsets that produces a summary which can be queried via which.min() for AIC and the like or which.max() for (adjusted) R² and related metrics.

To make the regression run faster you may want to strip your data of the geometry column via st_drop_geometry() call.