Converting some variables from int to category during data preprocessing for ML using R

I have a dataset about gaming players that you can see on the images below.

The goal here is to predict the value of: Score.

[NUM] score
[DATE] RegistrationDate
[CAT] Gender
[NUM] Age
[CHR] City
[CHR] State
[CAT] Group
[CAT] GamingRoom
[NUM] PointsEarned
[CHR] Sponsor
[CAT] ServerNode
[NUM] DistanceFromServer
[CAT] PlayerType
[CHR] Device

NUM: numeric value
CAT: category value (or enumeration)
CHR: string value

The dataset was originally on a .CSV file, imported by running the following command:

dataset_xyz = read.csv("R/xyz/dataset_xyz.csv")

I have two questions about this here:

Question 1: How can I transform the columns: { Gender, Group, Gaming, ServerNode, PlayerType } from numeric to something like category?, I don't want that during the training process these numbers be handled incorrectly.

Question 2: When I use string values on the columns during the training process, should these values be as Factor like on the image above or should be as chr (characters)?

Question 3: Before the training process, when I normalize, for example the numeric values (most likely by using: z-scores technique), how do I keep those values in order to use the same scaling to normalize the test data?. Both data need to be handled with the same scale.

Thanks in advance!

You do not say what sort of model you plan to build and I am only familiar with a limited number of functions. However, I would say

Q1: Use as.factor(). If you have both single digit and multiple digit numbers in a column, you may want to put a leading zero on the single digit values to preserve the order. You can use

df <- df %>% mutate(Group = formatC(Group, width = 2, flag = "0"))

which will turn the column into characters. Then proceed with as. factor.

Q2. I suggest making them factors but I expect the modeling function will do that for you if you leave them as character.

Q3. If you are going to split one data set into test and training subsets, I would normalize before splitting. If you want to store the values, you can save them in a file with the save() command.

save(trainMean, trainSD, file = "zNumbers.Rdata")

@FJCC, thanks for your response, very helpful.

About Q1:

Question 1: at what point I have to use: as.factor()?

About Q3: I was thinking also on the future new values (Production).

Version 1:

[1] { dataset_training, dataset_testing } = split_dataset(dataset)
# training time
[2] { dataset_training_normalized, norm_config } = normalize(dataset_training)
[3] trained_model = train(dataset_training_normalized)
# testing time
[4] dataset_testing_normalized = normalize(dataset_testing, norm_config)
[5] test_result = test(trained_model, dataset_testing_normalized)
# production time
[6] input_normalized = normalize(input,  norm_config)
[7] prediction = predict(trained_model, input_normalized)

I know [4] could be done in a previous step as follows:

Version 2:

[1] { dataset_normalized, norm_config } = normalize(dataset)
[2] { dataset_training_normalized, dataset_testing_normalized } = split_dataset(dataset_normalized)
# training time
[3] trained_model = train(dataset_training_normalized)
# testing time
[4] test_result = test(trained_model, dataset_testing_normalized)
# production time
[5] input_normalized = normalize(input,  norm_config)
[6] prediction = predict(trained_model, input_normalized)

But with Version 1, the testing time is more similar to the production time because it also contains the normalization step. Then I can test both things at the same time: normalization and model.

Question 2: does this make sense? if not or not at all, please, let me know what do you think?

Question 3: how do we get the values for: { trainMean, trainSD }? what function do I need to call and how?

Question 4: by the way, could you suggest me the model(s) that fit the more to the problem I described above?, if you want you can enlarge the embedded image.


New Q1: I would use as.factor() immediately after importing the data but it really does not matter when you do it. Just remember not to normalize factors.

New Question 2: Your approach seems fine to me.

New Question 3: Here is an example of getting the mean and standard deviation of four columns in a data frame. You can then save() the Stats data frame and bring it back in to the environment with load().

df <- data.frame(A = 1:3, B = 2:4, C = 3:5, D = 4:6)
#>   A B C D
#> 1 1 2 3 4
#> 2 2 3 4 5
#> 3 3 4 5 6
Stats <- df %>% summarize(MeanA = mean(A), SDA = sd(A),
                          MeanB = mean(B), SDB = sd(B),
                          MeanC = mean(C), SDC = sd(C),
                          MeanD = mean(D), SDD = sd(D))
#>   MeanA SDA MeanB SDB MeanC SDC MeanD SDD
#> 1     2   1     3   1     4   1     5   1

Created on 2019-08-23 by the reprex package (v0.2.1)

New Question 4: I cannot make a reasonable suggestion about modeling the data without knowing a lot more about it.

thank you @FJCC that worked!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.