I need to find some functions: { magic_function_1
, magic_function_2
} or similar to achieve what is described below.
Suposing we have this helper function:
my.normalize = function(vec){
if (is.numeric(vec)) {
vec = (vec - min(vec)) / (max(vec) - min(vec))
}
return (vec)
}
Initial dataset:
ds_1 = data.frame(
score = c(142, 89, 540, 38, 232, 142),
age = c(20, 18, 76, 54, 15, 22),
points = c(4, 50, 100, 10, 9, 35),
group = c("A", "B", "A", "A", "C", "B"),
favoritedrink = c("Coke", "Water", "Water", "Wine", "Tea", "Coke"),
type = c("1", "2", "3", "1", "2", "1")
)
ds_1
## score age points group favoritedrink type
## 1 142 20 4 A Coke 1
## 2 89 18 50 B Water 2
## 3 540 76 100 A Water 3
## 4 38 54 10 A Wine 1
## 5 232 15 9 C Tea 2
## 6 142 22 35 B Coke 1
What I want to simulate:
ds_2 = mutate_if(ds_1, is.numeric, my.normalize)
ds_3 = data.frame(model.matrix(~ score + age + points + group + favoritedrink + type, data = ds_2))[, -1]
ds_3
## score age points groupB groupC favoritedrinkTea
## 1 0.2071713 0.08196721 0.00000000 0 0 0
## 2 0.1015936 0.04918033 0.47916667 1 0 0
## 3 1.0000000 1.00000000 1.00000000 0 0 0
## 4 0.0000000 0.63934426 0.06250000 0 0 0
## 5 0.3864542 0.00000000 0.05208333 0 1 1
## 6 0.2071713 0.11475410 0.32291667 1 0 0
## favoritedrinkWater favoritedrinkWine type2 type3
## 1 0 0 0 0
## 2 1 0 1 0
## 3 1 0 0 1
## 4 0 1 0 0
## 5 0 0 1 0
## 6 0 0 0 0
For example, I'm looking some magic function: magic_function_1
to achieve the following:
ds_3 = magic_function_1(ds_1)
# where that magic function also saves the following config:
ds_3.config = [saved config to convert future values with same parameters]
where ds_3
should be the same table/output as shown before and ds_3.config
is the configuration that made possible that transformation. This configuration could be used later on to do transformations keeping the same scales / parameters / etc. For example, inside that config could be stored the min/max values of the numeric variables, or the possible values of the categorical variables, etc.
Then ...
If in the future, if I have the following input:
input = ds_1[5,]
rownames(input) = NULL # just resetting the row indexes
input
which was on the initial table, then we get the following:
out_1 = magic_function_2(input, ds_3.config)
all(out_1 == ds_3[5,]) == TRUE # in other words: out_1 should be equals to ds_3[5,] which is the corresponding row after normalization
Also, when using any other input that was not necessary included on ds_1
, for example:
input = data.frame(
score = 100,
age = 16,
points = 73,
group = "C",
favoritedrink = "Water",
type = "2"
)
when we call:
out_2 = magic_function_2(input, ds_3.config)
then, on out_2
the numeric values should be scaled properly according to ds_3.config
and the categorical values should be tranformed accordingly (as you can see on the second table above).
In the other hand, if we pass some categorical value that was not on the original dataset ds_1
, for example:
input = data.frame(
score = 100,
age = 16,
points = 73,
group = "C",
favoritedrink = "Rum",
type = "2"
)
when we call:
out_3 = magic_function_2(input, ds_3.config)
then, we should get an error because Rum
was not on the initial dataset.