Selecting suitable supervised learning algorithm

shameenkhan · July 15, 2018, 2:11pm

i want to predict e-customer purchase behavior that either e-customer will buy an item or not. the training dataset i am using for this purpose containing 40596053 number of records.
I need a help in table form comparing supervised learning algorithms as a review activity that which one algorithm effectively solve my problem? any help will be really appreciated.

Leon · July 15, 2018, 3:42pm

That is a very wide question and the answer unfortunately is not simple, but very much dependent on your particular case, aim and priorities.

So, a two-class classifier. I would recommend creating a base learner using a logistic regression, then up the complexity with random forest and then finally you can try a neural network. For each step, you should record the performance and then you need to outweigh model performance versus model interpretability. Be careful with over-fitting, consider cross validation and think about how you record your performance and also which measure you use.

For consistent model comparisons, I would recommend looking into the mlr package, you can take a look at the official tutorial.

I hope this gets you started and good luck

shameenkhan · July 15, 2018, 5:08pm

thank you so much but my dataset is very large and logistic regression and Random Forest are not a good choice because the algorithms work not better with large dataset. Logistic Regression work well with a small dataset and Random Forest takes too much time and memory space to generate a model and i have only 8 Gb RAM installed on my system

Leon · July 15, 2018, 5:19pm

No problem - Then split your data set into e.g. 10 chunks, use each separately to build each of the before-mentioned models and finally, combine the 10 models to create an ensemble model. You can also run feature importance algorithms and exclude non-informative variables to reduce the size of your data set and/or use dimensionality reduction techniques.

Finally, if you henceforth are going to work on really big data sets - Get. more. ram.

shameenkhan · July 15, 2018, 5:24pm

oh thank you soooooo much sir

JoseCastro · July 15, 2018, 7:32pm

Hi. How many variables or dimensions in your data set?

shameenkhan · July 16, 2018, 5:06am

dim(mydata)
[1] 40596053 8

shameenkhan · July 16, 2018, 5:14am

the future work of thesis i am following did the same work with random forest and logistic regression and suggest svm and nn. i tried nn but got errors so suggest any other algorithm sir

shameenkhan · July 16, 2018, 5:18am

dim(mydata)
[1] 40596053 8

Leon · July 16, 2018, 6:01am

In order to get further help, you will need to supply a reproducible example.