I guess this is very similar to the question that was highlighted a few weeks ago and I should caveat my question with that i am far from an expert.

I am reading Applied Predictive Modelling by Dr Max Kuhn and Dr Kjell Johnson. I am currently reading the chapter in relation to feature selection. One of the points highlighted by the book is that feature selection takes predictors in isolation. What i mean by this is that items or features which have a low correlation with the item you are trying to predict or a high correlation with each other might be dropped as they are considered poor predictors in isolation. However there might be some interactions going on in the data-set that the researcher is not aware of where predictors in combination with each other could make excellent predictors. Theory should inform this of course and should be the primary driver especially during the data gathering stage but I suppose my questions are

If someone has a very large data-set with lots of predictors, Is there a modelling method that could point a researcher to investigate specific interactions.

Further to this, given interactions that this as yet unnamed modelling technique have highlighted, assuming the researcher has split training and test sets, Is it possible then to use these findings (generated model) in an inference capacity on the test set

Could someone direct me to some useful material on the subject (books, articles, blog posts).
I'm open to the very real possibility this may be a very novice question and might be complete nonsense .

if you start with a design matrix that has the pool of interactions that you want to evaluate (along with the main effects), you could use a regularization model (such as glmnet) to see which seem important to the model. You might also put them into a genetic algorithm (like caret::gafs) to see what is selected. This approach is much more computationally expensive than glmnet though.

A second degree MARS model will selectively add interactions when the model thinks that they are needed.

In a decision tree, there is an idea that two factors that are split in succession might indicate an interaction. I've seen papers on this but can't find them at the moment (I need to spend so time to find these soon).

There is also the idea that is it unlikely (but not impossible) that an interaction would exist when either one of its main effects are not important. You might try building models without the interactions then selectively add them at the end.

You could write a selection algorithm that leaves two variables out (along with their interaction) at a time to look for what seems to help.

We're writing a whole chapter on this for the next book (but are still about two chapters away from getting out ideas together).

Thanks very much for the detailed feedback. I will go away and research all of these areas. I can post any interesting papers i find if you are interested?

Previously I mentioned i would see if there was some way to automatically select valid interactions.
I should mention I'm not a data scientist so my understanding of some of this might be somewhat flawed.
I basically used the suggestions by @max as a starting point.

It seems decision trees can be used as a means to find interactions. This document here shows the author first using decision trees to find interactions and then testing the significance of these interactions using a linear regression model. I assume this method only works if the decision tree finds linear interactions.

1 outlines a few methods of which one, Association rules (Apriori) have been used to find interactions between sets of drugs.

In 2 I found first uses Bagging to find important features and then uses a technique called Additive Groves to find interactions. The data used is ecology data. The researchers provide all their code on github and there is even a package for it which looks promising.

3 suggests that while predictors on their own may not provide good predictive power when taken in isolation, these weak predictors when combined could prove useful. The paper goes on to describe in great detail an algorithm called INTERACT which searches for these predictors. The maths in this one is slightly beyond me but i added it in case anyone has the urge to explain it in layman terms

4 is a practical application of hierarchical group-lasso regularization paper by M. Lim, T. Hastie which has a package here for it. The author uses it in the context of trying to find interactions in his dataset for energy conservation. It is an interesting read as it seems grounded in reality and really helped me understand the original paper5. To echo one of @Max points above it seems the paper defines interactions in hierarchical trees as in finding the optimal first split, the boosting algorithm is looking for the best main effect. The subsequent split is then made, conditioned on the first split. The original paper also points to other research in this area including the following papers 6, 7 and 8. As far as i can tell its not in caret.

9 used several tree type algorithms to determine interactions among variables in an an ecological data-set. They used partial dependency plots to have a look at these interactions and how they related to the dependent variable. From what I can tell all the algorithms used are in R and the authors have a GitHub page of all their code.

Finally 10 shows how one very clever researcher used xgboost to find important interactions in a model. xgboost as far as i can tell (I'm open to correction) is a variation on gradient boosted machines but is a faster implementation. The interaction component was originally created it seems for a Kaggle contest for Bosch where a number of metrics allow a user to rank their interactions in terms of importance. The github for the executable for the XGBoost Feature Interactions & Importance (RXGBfi) can be found online. My only concern is I wouldn't be sure what a good score is for any discovered interactions. If anyone can point me somewhere that would be much appreciated. Interestingly enough the gbm package in R also has a feature interaction function called interact.gbm which allows you to calculate a H statistic between 0-1 to give you an idea if your interaction is useful. I guess you could use xgboost first, find the feature interactions and then calculate the H stat afterwards. If anyone has seen this in the wild, i would be forever greatful :).
There is an R port for RXGBfi which creates an excel sheet where you can specify how deep you want to go with the interactions. It also incorporates a shiny app so you can see the strength of your interactions but i was unable to get the Shiny portion to run. The code is also available on GitHub so its easy to pull the function without the shiny app. I suspect i have something not set up correctly in my system. Just a quick word of warning, it requires that an executable be installed on your C drive. I had no problems with it on a windows machine. If however you don't want to install anything on your machine, this blog gives a quick overview with screen shots

Anyway that's what i could find and i hope you find it useful. There are some incredibly smart people out there and it seems to be an area of active research. Apologies for the delay in coming back to this

@john.smith - that's an awesome set of articles, thanks for your work on this one!

I can add one more that I came by today. Here you will find the theory and the implementation is already done in the iml package that generally seems to provide many useful ways to make model more interpretable.

I realise we are moving slightly off topic but below are some very interesting presentations from the Budapest R group ( 1, 2 ) of a package called DALEX. They must be doctor who fans . The aim of the package allows you to interpret machine learning models

I must play around with it to see if it can be used to interpret interactions found in models

kriging cubist and nn s can detect interactions. cubist in particular gets 0 RMSE on
IF col3>0 targetcolumn = col2
ELSE targetcolumn = - col2

That is an example of interactions right?

You could try generating every interaction of interest. Quite a few models can deal with thousands of columns of noise. Ofc all of my info comes from toy artifical data and may not stand up to real problems.