I'm doing an `Exploratory Data Analysis (EDA)`

including different `Unsupervised Analyses`

techniques in order to select the right variables for the `Supervised Analysis`

which will be done with `Neural Networks (NN)`

. The variable to predict will be: `score`

.

```
nr1 = nrow(myds)
nr2 = nrow(myds[myds$score != 0,])
nr1
nr2
cat(sprintf('Ratio of values under: "Score" different than 0: %.4f', nr2/nr1))
```

```
29 variables = 23 categorical + 6 continuos
```

Right now I'm focused on the `categorical`

variables. By the way, I already opened a new topic for the `continuos`

variables here.

I splitted the original dataset and right now I'm working with a dataset where all variables are

`categorical`

except the`score`

, which I included here for obvious reasons.

When I run:

```
library("DataExplorer")
plot_correlation(myds)
```

I get:

where the `score`

(variable to predict) is on the first row on the bottom (and first column on the left). I added some green points to highlight where I see some color change on the `score`

row.

Here we have more info about our dataset:

```
$ score : num 0 0 0 0 0 0 0 0 0 0 ...
$ catVar_01 : Factor w/ 2 levels ...
$ registrationDateMM : Factor w/ 9 levels ...
$ registrationDateDD : Factor w/ 31 levels ...
$ registrationDateHH : Factor w/ 24 levels ...
$ registrationDateWeekDay : Factor w/ 7 levels ...
$ catVar_06 : Factor w/ 140 levels ...
$ catVar_07 : Factor w/ 21 levels ...
$ catVar_08 : Factor w/ 1582 levels ...
$ catVar_09 : Factor w/ 70 levels ...
$ catVar_10 : Factor w/ 755 levels ...
$ catVar_11 : Factor w/ 23 levels ...
$ catVar_12 : Factor w/ 129 levels ...
$ catVar_13 : Factor w/ 15 levels ...
$ city : Factor w/ 22750 levels ...
$ state : Factor w/ 55 levels ...
$ zip : Factor w/ 26659 levels ...
$ catVar_17 : Factor w/ 2 levels ...
$ catVar_18 : Factor w/ 2 levels ...
$ catVar_19 : Factor w/ 3 levels ...
$ catVar_20 : Factor w/ 6 levels ...
$ catVar_21 : Factor w/ 2 levels ...
$ catVar_22 : Factor w/ 4 levels ...
$ catVar_23 : Factor w/ 5 levels ...
```

where: `{ MM: month, DD: day of the month, HH: hour }`

When I run the `plot_correlation`

command above, it shows some warnings:

```
> plot_correlation(dataset_cat)
11 features with more than 20 categories ignored!
registrationDateDD: 31 categories
registrationDateHH: 24 categories
catVar_06: 138 categories
catVar_08: 1571 categories
catVar_09: 65 categories
catVar_10: 732 categories
catVar_11: 23 categories
catVar_12: 129 categories
city: 22604 categories
state: 54 categories
zip: 26458 categories
```

## Thinking about this

Before categorical variables get passed to the model, each of them need to be converted to multiple dummy variables. For example, for variable: `State`

, if there are `55`

levels, it will be converted to `54`

dummy variables, which is a lot. On top of that, on the list above there are other categorical variables with much more levels.

## My Questions

1- Is there any way to extract valuable information from these categorical variables with several levels?

2- What do you think about: In order to investigate for a possible correlation between categorical variables and the `score`

, for each categorical variable I'm going to group by it and then calculate the mean `score`

for the group. If there are significant changes between each mean, then most likely that categorical variable has some impact on the `score`

. I don't know if this makes sense or not, I just figure that out after feeling frustrated when the command: `plot_correlation`

didn't handle those categorical variables and I wanted somehow to get information from them.

3- What do you tihnk about the `Binary Encoding`

for catgorical variables with many values? On this article, the author says: "*With (for example) only three levels, the information embedded (with One-hot Encoding) becomes muddled. There are many collisions and the model canâ€™t glean much information from the features. Just One-hot encode a column if it only has a few values. In contrast, Binary Encoding really shines when the cardinality of the column is higher - with the 50 US states, for example. Binary Encoding creates fewer columns than One-hot Encoding. It is more memory efficient. It also reduces the chances of dimensionality problems with higher cardinality.*"

Regarding to the `Question 3`

the point is that before deciding to include a categorical variable on the model, I would need to know if it worth it or not. That's why I would like to detect somehow any relation between categorical variables and the `score`

on the `unsupervised analysis`

phase. Of course, we could train several more `Neural Networks`

with and without the `categorical values`

and compare the errors but that's extra work that I would like to avoid.

Thanks for your attention!