i want to predict e-customer purchase behavior that either e-customer will buy an item or not. the training dataset i am using for this purpose containing 40596053 number of records.
I need a help in table form comparing supervised learning algorithms as a review activity that which one algorithm effectively solve my problem? any help will be really appreciated.
That is a very wide question and the answer unfortunately is not simple, but very much dependent on your particular case, aim and priorities.
So, a two-class classifier. I would recommend creating a base learner using a logistic regression, then up the complexity with random forest and then finally you can try a neural network. For each step, you should record the performance and then you need to outweigh model performance versus model interpretability. Be careful with over-fitting, consider cross validation and think about how you record your performance and also which measure you use.
For consistent model comparisons, I would recommend looking into the mlr package, you can take a look at the official tutorial.
I hope this gets you started and good luck
thank you so much but my dataset is very large and logistic regression and Random Forest are not a good choice because the algorithms work not better with large dataset. Logistic Regression work well with a small dataset and Random Forest takes too much time and memory space to generate a model and i have only 8 Gb RAM installed on my system
No problem - Then split your data set into e.g. 10 chunks, use each separately to build each of the before-mentioned models and finally, combine the 10 models to create an ensemble model. You can also run feature importance algorithms and exclude non-informative variables to reduce the size of your data set and/or use dimensionality reduction techniques.
Finally, if you henceforth are going to work on really big data sets - Get. more. ram.
oh thank you soooooo much sir
Hi. How many variables or dimensions in your data set?
dim(mydata)
[1] 40596053 8
the future work of thesis i am following did the same work with random forest and logistic regression and suggest svm and nn. i tried nn but got errors so suggest any other algorithm sir
dim(mydata)
[1] 40596053 8