My question is: Given the current situation, how can I do the data exploration with them in order to decide whether include them on the model or not (I will use Neural Networks).
When the number of level for categorical variables is low I can use tools like:
DataExplorer::plot_correlation()
Gower / cluster::daisy()
But my problem here is that the amount of data is huge.
Data Preparation is where you will spend most of your time data cleaning, transformation, reduction, balancing, sampling or other advanced data preparation methods.
What is your predictor variable/class? The algorithm to use will depending on the business problem and other factors such as performance(eg how fast it takes to build model with cross-validation) and easy readability of the model.
For what attributes to use, there are different attribute selection techniques, you can use that will help rank(Gain Ration) or select attributes for you.
Last but not least, know your data, it no domain knowledge research about dataset.
I tried to keep the answer really high level without going into details the tools to use because that is up to you.
Data Preparation is already done, and yes, we spent lot of time on that trying to select the most interesting features for our problem. The current dataset has no missing values. Reductions, feature transformation, value imputations, outlier processings, etc, were done.
The variable to predict is: score which as you can see above, it is continuous.
This is why I posted the current thread, because trying to figure out what attributes are the most interesting.
We are totally familiar with our data. So, if at some point we need to take some decision based on the knowledge about it, that won't be a problem.
Getting a firm grasp on categorical features, especially when they have high cardinality, is difficult with regular exploratory data analysis approaches. Often, with a data set like this, I will do some exploratory work with a random forest model.
Random forests work very naturally with categorical features and using various interpretation methods (i.e. variable importance, PDPs) you should be able to identify which features seem to have a signal and which do not.
This also allows you to see the impact of different feature engineering approaches. A neural net is going to require you to convert your features into numeric values one way or another (i.e. one-hot encode, ordinal encode). With a random forest you can easily compare the impact these encodings have on performance and identifying signals. For example, if there's any order to some of your categorical features then ordinal encoding should improve your RF. If there is no order, then compare how label encoding vs. one-hot encoding impacts performance. This will be importance because you have high cardinality in some of your features (i.e. catVar_08) and one-hot encoding these features will explode the number of parameters your neural net has...which can have computational and performance impacts.
@tlg265, never rush to pick an algorithm to use upfront. You need to perform what is called experiments. This allows you test multiple algorithms and pick one based on business requirement and technology infrastructure.
Some people will rush to say, use neural networks, stacking, ensemble, random forest, etc. Do not do that, because the type of data sometimes may require different algorithms.
I love R and I use it together with python. However, I took this graduate certificate and I learned a lot about proper data mining and selection of algorithm. I learned Weka and played around with Orange.
Weka allows you to focus on really cleaning data, and finding best algorithm without worry about programming. After deciding on algorithm, you can always use R or python for automation or just use model from Weka.
This guy here has good tutorials I found easy to understand.
You want to develop a process that reproducible in building your model. Is your dataset balanced, how did you test for that? If your data is not sensitive, random sample about 10000 observations put a link for that dataset as csv and I will play around and see what algorithm will be the best.
Never just randomly pick an algorithm, you must show why you pick that algorithm.
Thanks