I'm reading the excellent book Introduction to Feature Engineering... (Kuhn & Johnson) and am confused by the example with the Ames housing data showing how to use Two-Stage modeling when building models with interactions (that have lots of base variables).

High level, I understand the idea as:

- Identify which
*base predictors are important*(e.g. by identifying predictors remaining after using lasso regression) - Create all pairwise interactions from selected variables
- Input base predictors and interactions into another model that will again do variable selection (e.g. lasso) to get final model

What I am confused by is how the modeling of the error in the 2nd model building stage (as described at the top of the example) fits into the `ames`

modeling example. Was this just an explanatory note -- as it seems like in both step 1 and 3 when the models are being built it would just have a target of `Sale_Price`

in both cases, rather than `Sale_Price`

in the former and `Error`

in the latter, correct?

E.g. for `ames`

data say we are predicting `Sale_Price`

and have 6 initial variables `Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold`

*Step 1:*

Build lasso model with 6 base variables as input. Say lasso selects 3 variables `Year_Built`

`Year_Sold`

, and `Lot_Area`

,

*Step 2:*

Create interactions (based on strong hereditary principle`Year_Built*Year_Sold`

, `Year_Sold*Lot_Area`

, `Year_Built*Lot_Area`

*Step 3 (HERE's WHERE I'M CONFUSED):*

Build a new lasso model using selected main effects and corresponding interactions with target of `Sale_Price`

? I.e. model:

`Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + Gr_Liv_Area + Lot_Area+ Year_Sold + Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area`

(This seems to be what is done in the [code])(https://github.com/topepo/FES/blob/master/07_Detecting_Interaction_Effects/7_04_The_Brute-Force_Approach_to_Identifying_Predictive_Interactions/ames_glmnet.R)

OR is there a step where the resulting interaction terms are modeled against the error, per the opening example of this section and the comment at the bottom about using modeling error in the classification context. E.g.

error_mod1 = Sale_Price - pred_mod1

Build lasso regression for:

`error_mod1 ~ Year_Built*Year_Sold + Year_Sold*Lot_Area + Year_Built*Lot_Area`

... and then follow-up step...?

My guess is the former is correct, though just wanted to make sure?

P.s. let me know of the github page for the book (or other location) would be a better place to post this question.