I am a classical statistician now trying to educate myself on deep learning, initially for regression problems, using Deep Learning in R, 2nd edition, as my textbook. I have read the chapter on the Boston Housing problem and am now working on the full Ames Housing data set (not the Kaggle version). Following the book’s recommendation, I first developed a baseline classical statistical approach to serve as a minimum benchmark model which I will try to improve using deep learning. Specifically, I have completed the data for all 82 variables and all 2930 house sale records and used Breheny’s grpreg (a glmnet like variable selection package) to select the 140 individual model matrix columns that optimize the prediction and cross-validation of log(SalePrice). This gives me a median absolute deviation (mad) of $8,241 from the actual SalePrice, and a mad to true value error ratio of 0.05, which seems to me to be a pretty good model. I am trying to see whether I can better this result using deep learning. I started with the following pretty simple model:
build_model <- function() { model <- keras_model_sequential() %>%
layer_dense(150, activation = “relu”) %>%
layer_dense(150, activation = “relu”) %>%
layer_dense(1)
model %>% compile(optimizer = 'rmsprop’, loss = 'mse’, metrics = ‘mae’)
model }
set.seed(12345) model <- build_model()
history <- model %>% fit(Xtrain2, logytrain2,
epochs = 100, batch_size = 32, verbose = 0,
validation_data = list(Xtest2, logytest2) )
This gives me a result not quite as good as I got with classic regression (mad SalePrice error of $12,950 and mad error ratio of 0.06).
Chollet’s book, though excellent, gives me little guidance as to how I can try to improve the results of this regression problem. Using fewer units degrades the model performance, but increasing the number doesn’t improve it. Using a larger batch size speeds up the processing (and leads to less stable mad estimates by epoch), but does not change the final estimated mad. I have found that adding additional dense layers does not meaningfully change the final result. I tried using leaky_ReLu activation with four hidden dense layers, thinking that my models might have accumulated many dead neurons. This result was marginally better than the base model above (mad of $8,343 from the actual SalePrice, and a mad to true value error ratio of 0.0535), but still less good than my grpreg estimates. Unlike the other models in the book, using the above data set there is no evidence of over-fitting with even 500 epochs, and a stable and optimal mad estimate is reached at about the 50 - 60th epoch. So far as I can see, the other types of layers are designed primarily for text, graphic, or classification, rather than for simple regression.
Since I am working pretty much in a vacuum, with only the book and the web as a guide. I am looking for suggestions as to how I can further improve the above model. Any suggestions for specific improvements, or pointers to a forum where I can get some guidance, would be deeply appreciated. Thanks in advance for any suggestions.
I am also interested in the experience of others as to how much a deep learning approach is likely to improve classical statistical solutions to simple regression models. Some of my reading has suggested skepticism that deep learning is likely to help much with simple regression. I’d welcome some discussion of that issue. Again, thanks.
Larry Hunsicker