I'm working on a Machine Learning
project where I have both: continuous
and discrete
variables. The goal is to predict the target variable: score
in around 1 second
or less.
The nature of the data is as you can see below:
> str(myds)
'data.frame': 841500 obs. of 30 variables:
$ score : num 0 0 0 0 0 0 0 0 0 0 ...
$ amount_sms_received : int 0 0 0 0 0 0 3 0 0 3 ...
$ amount_emails_received : int 3 36 3 12 0 63 9 6 6 3 ...
$ distance_from_server : int 17 17 7 7 7 14 10 7 34 10 ...
$ age : int 17 44 16 16 30 29 26 18 19 43 ...
$ points_earned : int 929 655 286 357 571 833 476 414 726 857 ...
$ registrationYYYY : Factor w/ 2 levels ...
$ registrationDateMM : Factor w/ 9 levels ...
$ registrationDateDD : Factor w/ 31 levels ...
$ registrationDateHH : Factor w/ 24 levels ...
$ registrationDateWeekDay : Factor w/ 7 levels ...
$ catVar_05 : Factor w/ 2 levels ...
$ catVar_06 : Factor w/ 140 levels ...
$ catVar_07 : Factor w/ 21 levels ...
$ catVar_08 : Factor w/ 1582 levels ...
$ catVar_09 : Factor w/ 70 levels ...
$ catVar_10 : Factor w/ 755 levels ...
$ catVar_11 : Factor w/ 23 levels ...
$ catVar_12 : Factor w/ 129 levels ...
$ catVar_13 : Factor w/ 15 levels ...
$ city : Factor w/ 22750 levels ...
$ state : Factor w/ 55 levels ...
$ zip : Factor w/ 26659 levels ...
$ catVar_17 : Factor w/ 2 levels ...
$ catVar_18 : Factor w/ 2 levels ...
$ catVar_19 : Factor w/ 3 levels ...
$ catVar_20 : Factor w/ 6 levels ...
$ catVar_21 : Factor w/ 2 levels ...
$ catVar_22 : Factor w/ 4 levels ...
$ catVar_23
Question 1: Given the requirements above, what would be the best prediction algorithm?
If I go to the following Wizard link:
https://mod.rapidminer.com/#app
And I check:
Column types: { Numerical, Categorical }
Target type: Numerical
Number of columns: 10s
Number of rows: 100'000s
Then, the only enabled predictive algorithm is: KNN
.
Unfortunately KNN
is not an option for me because I have the requirement that the prediction needs to be done in 1 second
or less.
Then, if we transform the dataset by removing many (almost) not used discrete values on discrete variables, then doing one hot encoding to discrete variables, then doing target/mean encoding
for: { city, zip }
, then we will get around 300
numerical columns.
Then, we input that again into the Wizard:
https://mod.rapidminer.com/#app
Column types: { Numerical } (changed)
Target type: Numerical
Number of columns: 100s (changed)
Number of rows: 10'000s (changed)
and now we get: Neural Networks
as an option. By the way, if we change the number of rows from: 10'000
to 100'000
again, then the Neural Networks
option disapears.
For now let's proceed with: Number of rows: 10'000s`
If we change from:
Column types: { Numerical }
Target type: Numerical
Number of columns: 100s
Number of rows: 10'000s`
To:
Column types: { Numerical, Binary } (changed)
Target type: Numerical
Number of columns: 100s
Number of rows: 10'000s`
(just adding: Binary
to the column types)
Then the Neural Networks
dissapears again.
My concern here is that when we do hot encoding
to the discrete variables the resulting columns are actually binary
.
Question 2: Could you give me some hints about what's going on here?
Question 3: Do you know about any table or checklist, that let me know what Machine Learning
algorithms out there should be discarded given the nature of a given problem? I did a search on Goolge
but didn't get a really reliable answer.
The Wizard above doesn't tell me why it is discarding the Neural Networks
.
Thanks!