Converting some variables from int to category during data preprocessing for ML using R

robertm · August 23, 2019, 8:52am

I have a dataset about gaming players that you can see on the images below.

The goal here is to predict the value of: Score.

[NUM] score
[DATE] RegistrationDate
[CAT] Gender
[NUM] Age
[CHR] City
[CHR] State
[CAT] Group
[CAT] GamingRoom
[NUM] PointsEarned
[CHR] Sponsor
[CAT] ServerNode
[NUM] DistanceFromServer
[CAT] PlayerType
[CHR] Device

Where:
NUM: numeric value
CAT: category value (or enumeration)
CHR: string value

The dataset was originally on a .CSV file, imported by running the following command:

dataset_xyz = read.csv("R/xyz/dataset_xyz.csv")

I have two questions about this here:

Question 1: How can I transform the columns: { Gender, Group, Gaming, ServerNode, PlayerType } from numeric to something like category?, I don't want that during the training process these numbers be handled incorrectly.

Question 2: When I use string values on the columns during the training process, should these values be as Factor like on the image above or should be as chr (characters)?

Question 3: Before the training process, when I normalize, for example the numeric values (most likely by using: z-scores technique), how do I keep those values in order to use the same scaling to normalize the test data?. Both data need to be handled with the same scale.

Thanks in advance!

FJCC · August 23, 2019, 1:02pm

You do not say what sort of model you plan to build and I am only familiar with a limited number of functions. However, I would say

Q1: Use as.factor(). If you have both single digit and multiple digit numbers in a column, you may want to put a leading zero on the single digit values to preserve the order. You can use

library(dplyr)
df <- df %>% mutate(Group = formatC(Group, width = 2, flag = "0"))

which will turn the column into characters. Then proceed with as. factor.

Q2. I suggest making them factors but I expect the modeling function will do that for you if you leave them as character.

Q3. If you are going to split one data set into test and training subsets, I would normalize before splitting. If you want to store the values, you can save them in a file with the save() command.

save(trainMean, trainSD, file = "zNumbers.Rdata")

robertm · August 23, 2019, 6:23pm

@FJCC, thanks for your response, very helpful.

About Q1:

Question 1: at what point I have to use: as.factor()?

About Q3: I was thinking also on the future new values (Production).

Version 1:

[1] { dataset_training, dataset_testing } = split_dataset(dataset)
# training time
[2] { dataset_training_normalized, norm_config } = normalize(dataset_training)
[3] trained_model = train(dataset_training_normalized)
# testing time
[4] dataset_testing_normalized = normalize(dataset_testing, norm_config)
[5] test_result = test(trained_model, dataset_testing_normalized)
# production time
[6] input_normalized = normalize(input,  norm_config)
[7] prediction = predict(trained_model, input_normalized)

I know [4] could be done in a previous step as follows:

Version 2:

[1] { dataset_normalized, norm_config } = normalize(dataset)
[2] { dataset_training_normalized, dataset_testing_normalized } = split_dataset(dataset_normalized)
# training time
[3] trained_model = train(dataset_training_normalized)
# testing time
[4] test_result = test(trained_model, dataset_testing_normalized)
# production time
[5] input_normalized = normalize(input,  norm_config)
[6] prediction = predict(trained_model, input_normalized)

But with Version 1, the testing time is more similar to the production time because it also contains the normalization step. Then I can test both things at the same time: normalization and model.

Question 2: does this make sense? if not or not at all, please, let me know what do you think?

Question 3: how do we get the values for: { trainMean, trainSD }? what function do I need to call and how?

Question 4: by the way, could you suggest me the model(s) that fit the more to the problem I described above?, if you want you can enlarge the embedded image.

Thanks!

FJCC · August 23, 2019, 8:38pm

New Q1: I would use as.factor() immediately after importing the data but it really does not matter when you do it. Just remember not to normalize factors.

New Question 2: Your approach seems fine to me.

New Question 3: Here is an example of getting the mean and standard deviation of four columns in a data frame. You can then save() the Stats data frame and bring it back in to the environment with load().

library(dplyr)
df <- data.frame(A = 1:3, B = 2:4, C = 3:5, D = 4:6)
df
#>   A B C D
#> 1 1 2 3 4
#> 2 2 3 4 5
#> 3 3 4 5 6
Stats <- df %>% summarize(MeanA = mean(A), SDA = sd(A),
                          MeanB = mean(B), SDB = sd(B),
                          MeanC = mean(C), SDC = sd(C),
                          MeanD = mean(D), SDD = sd(D))
Stats
#>   MeanA SDA MeanB SDB MeanC SDC MeanD SDD
#> 1     2   1     3   1     4   1     5   1

^{Created on 2019-08-23 by the reprex package (v0.2.1)}

New Question 4: I cannot make a reasonable suggestion about modeling the data without knowing a lot more about it.

robertm · August 26, 2019, 11:44pm

thank you @FJCC that worked!

system · September 2, 2019, 11:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.