Improve deep learning regression model

lhunsicker · June 27, 2023, 7:48pm

I am a classical statistician now trying to educate myself on deep learning, initially for regression problems, using Deep Learning in R, 2nd edition, as my textbook. I have read the chapter on the Boston Housing problem and am now working on the full Ames Housing data set (not the Kaggle version). Following the book’s recommendation, I first developed a baseline classical statistical approach to serve as a minimum benchmark model which I will try to improve using deep learning. Specifically, I have completed the data for all 82 variables and all 2930 house sale records and used Breheny’s grpreg (a glmnet like variable selection package) to select the 140 individual model matrix columns that optimize the prediction and cross-validation of log(SalePrice). This gives me a median absolute deviation (mad) of $8,241 from the actual SalePrice, and a mad to true value error ratio of 0.05, which seems to me to be a pretty good model. I am trying to see whether I can better this result using deep learning. I started with the following pretty simple model:

build_model <- function() { model <- keras_model_sequential() %>%
layer_dense(150, activation = “relu”) %>%
layer_dense(150, activation = “relu”) %>%
layer_dense(1)
model %>% compile(optimizer = 'rmsprop’, loss = 'mse’, metrics = ‘mae’)
model }
set.seed(12345) model <- build_model()
history <- model %>% fit(Xtrain2, logytrain2,
epochs = 100, batch_size = 32, verbose = 0,
validation_data = list(Xtest2, logytest2) )

This gives me a result not quite as good as I got with classic regression (mad SalePrice error of $12,950 and mad error ratio of 0.06).
Chollet’s book, though excellent, gives me little guidance as to how I can try to improve the results of this regression problem. Using fewer units degrades the model performance, but increasing the number doesn’t improve it. Using a larger batch size speeds up the processing (and leads to less stable mad estimates by epoch), but does not change the final estimated mad. I have found that adding additional dense layers does not meaningfully change the final result. I tried using leaky_ReLu activation with four hidden dense layers, thinking that my models might have accumulated many dead neurons. This result was marginally better than the base model above (mad of $8,343 from the actual SalePrice, and a mad to true value error ratio of 0.0535), but still less good than my grpreg estimates. Unlike the other models in the book, using the above data set there is no evidence of over-fitting with even 500 epochs, and a stable and optimal mad estimate is reached at about the 50 - 60th epoch. So far as I can see, the other types of layers are designed primarily for text, graphic, or classification, rather than for simple regression.
Since I am working pretty much in a vacuum, with only the book and the web as a guide. I am looking for suggestions as to how I can further improve the above model. Any suggestions for specific improvements, or pointers to a forum where I can get some guidance, would be deeply appreciated. Thanks in advance for any suggestions.
I am also interested in the experience of others as to how much a deep learning approach is likely to improve classical statistical solutions to simple regression models. Some of my reading has suggested skepticism that deep learning is likely to help much with simple regression. I’d welcome some discussion of that issue. Again, thanks.
Larry Hunsicker

technocrat · June 28, 2023, 12:26am

Data can't be separated from the domains that generate them. Otherwise, all data might be autoregressive. Housing is a good example. Ames has dim == 2930,70, which superficially appears to be the high-p dataset at which machine learning shines. Except collinearity. The variables are richly collinear in the housing domain. Land slope and land contour are only the two most obvious examples.

VIF can be used in OLS, but I don't know how ML handles the issue.

Then, there's temporal and spatial autocorrelation. And that has a lot of nuance.

Housing starts nationwide were severely depressed from 1930-1946. The pre-1930 housing stock generally was organized along either walk-to-work or streetcar commuting patterns, both of which go along with higher housing density (walking distance to work or radial lines). Higher densities driven by transportation systems were opposed by social value placing a premium on single-family housing. The compromise was smaller lot size.

At the end of WWII, the national housing deficit was 6 million units (for a population of 120 million) at the beginning of a burst of new household formation. High demand/low supply in a market dominated by young buyers with low savings and lending standards requiring at least 20% equity meant that equilibrium was established through low-quality construction. Single-bath, 1-car garage was a standard feature set.

Sales were made in the period 2006-2010, when metropolitan resale prices had peaked and begun to fall slowly at first, but then rapidly. The same rear-view mirror that had supported appraised values in the boom, crushed them in the bust. In the buzzing housing market, defaults were fewer, because borrowers could re-finance their way out of payment difficulties due to higher estimated loan-to-value ratios and progressively easier financing terms. Sales of foreclosed property go at a discount to occupied (the rule of thumb was 15% plus 5% per month on market).

Finally, real property isn't fungible. Even within a subdivision with a limited range of models, units with identical specifications can experience pricing and purchase price variation that can't be explained simply by looking at features. Cathedral ceilings, great rooms, split-level entries and built-in bars are only a few of the features not captured in the variables that have affected prices in the past positively . Some of them have become negatives. Many of these are determined by fashion and fashion is a time series.

So housing data presents a dual curse of dimensionality: too many variables that capture too little of the variability. And we haven't even discussed the macro-economic environment of low-interest rates and the multiple purposes of housing, which is not only shelter, but also seen as forced savings, inflation hedge or speculative investment.

All these considerations make supervised learning hard enough, and it's a big ask of unsupervised learning.

I have a dataset of dim 125000,50 mortgage loan records with underwriting and payment information from sub-prime mortgage loans originated by a single lender in 2005 to early 2007 and have never developed a satisfactory model to predict default, which isn't even a continuous outcome. So, I'm not surprised.

That said, if you haven't already, review Max and Julia's worked examples with a truncated Ames dataset in their tidymodels book.

system · July 19, 2023, 12:27am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.