library("dplyr")
library("data.table")
library("h2o")
h2o.init(nthreads = -1)
I have the following data frame:
df = data.frame(
animal = as.factor(c("Dog", "Cat", "Lion", "Dog", "Cat", "Dog", "Cat", "Dog", "Dog", "Cat", "Lion", "Dog")),
rating = as.double(c(25.2, 15.8, 55.3, 29.0, 23.4, 33.0, 22.6, 31.9, 21.8, 28.5, 50.2, 27.1))
) %>% arrange(desc(rating))
print(df)
## animal rating
## 1 Lion 55.3
## 2 Lion 50.2
## 3 Dog 33.0
## 4 Dog 31.9
## 5 Dog 29.0
## 6 Cat 28.5
## 7 Dog 27.1
## 8 Dog 25.2
## 9 Cat 23.4
## 10 Cat 22.6
## 11 Dog 21.8
## 12 Cat 15.8
Then, by using H2O
, I can create the encoding_map
:
encoding_map = h2o.target_encode_create(
as.h2o(df),
x = list("animal"),
y = "rating"
)
encoding_map
## $animal
## animal numerator denominator
## 1 Cat 90.3 4
## 2 Dog 168.0 6
## 3 Lion 105.5 2
##
## [3 rows x 3 columns]
Target Encoding
| Basic
If I want to apply the most basic Target Encoding
I can do the following.
df_h2o_encoded_1 = h2o.target_encode_apply(
data = as.h2o(df),
x = list("animal"),
y = "rating",
target_encode_map = encoding_map,
holdout_type = "None", # using None for simplicity
blended_avg = FALSE,
noise = 0,
seed = 1234
)
df_encoded_1 = as.data.frame(df_h2o_encoded_1)
df_encoded_1 = cbind(data.frame(id = 1:nrow(df_encoded_1)), df_encoded_1)
df_encoded_1
## id animal rating TargetEncode_animal
## 1 1 Cat 28.5 22.575
## 2 2 Cat 23.4 22.575
## 3 3 Cat 22.6 22.575
## 4 4 Cat 15.8 22.575
## 5 5 Dog 33.0 28.000
## 6 6 Dog 31.9 28.000
## 7 7 Dog 29.0 28.000
## 8 8 Dog 27.1 28.000
## 9 9 Dog 25.2 28.000
## 10 10 Dog 21.8 28.000
## 11 11 Lion 55.3 52.750
## 12 12 Lion 50.2 52.750
We can get the same result very easily with paper and pencil. We just need to calculate the average for each category under: animal
. That’s it.
Target Encoding
| Blended Average
On the encoding_map
above, we can see the category: Lion
shows up on just a few of observations (comparing with other values). Then, the calculated average could be unreliable.
In order to handle this, H2O
supports the parameter: blended_avg
as we can see on:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/blended_avg.html
There we can read:
“The blended_avg parameter defines whether the target average should be weighted based on the count of the group. It is often the case, that some groups may have a small number of records and the target average will be unreliable. To prevent this, the blended average takes a weighted average of the group’s target value and the global target value.”
Let’s use the param: blended_avg
:
df_h2o_encoded_2 = h2o.target_encode_apply(
data = as.h2o(df),
x = list("animal"),
y = "rating",
target_encode_map = encoding_map,
holdout_type = "None", # using None for simplicity
blended_avg = TRUE,
noise = 0,
seed = 1234
)
df_encoded_2 = as.data.frame(df_h2o_encoded_2)
df_encoded_2 = cbind(data.frame(id = 1:nrow(df_encoded_2)), df_encoded_2)
df_encoded_2
## id animal rating TargetEncode_animal
## 1 1 Cat 28.5 29.01621
## 2 2 Cat 23.4 29.01621
## 3 3 Cat 22.6 29.01621
## 4 4 Cat 15.8 29.01621
## 5 5 Dog 33.0 29.85839
## 6 6 Dog 31.9 29.85839
## 7 7 Dog 29.0 29.85839
## 8 8 Dog 27.1 29.85839
## 9 9 Dog 25.2 29.85839
## 10 10 Dog 21.8 29.85839
## 11 11 Lion 55.3 33.49886
## 12 12 Lion 50.2 33.49886
Comparing both Target Encoding
values
df_comp = df_encoded_1
df_comp = df_comp %>% left_join(as.data.frame(encoding_map$animal), by = "animal") %>% select(-numerator)
df_comp = df_comp %>% left_join(df_encoded_2[,c("id", "TargetEncode_animal")], by = "id") %>% select(-c(id))
colnames(df_comp) = c("animal", "rating", "encoding_simple", "num_obs", "encoding_blended")
setcolorder(df_comp, c("animal", "rating", "num_obs", "encoding_simple", "encoding_blended"))
df_comp = df_comp %>% mutate(encoding_diff = encoding_blended - encoding_simple)
df_comp
## animal rating num_obs encoding_simple encoding_blended encoding_diff
## 1 Cat 28.5 4 22.575 29.01621 6.441209
## 2 Cat 23.4 4 22.575 29.01621 6.441209
## 3 Cat 22.6 4 22.575 29.01621 6.441209
## 4 Cat 15.8 4 22.575 29.01621 6.441209
## 5 Dog 33.0 6 28.000 29.85839 1.858393
## 6 Dog 31.9 6 28.000 29.85839 1.858393
## 7 Dog 29.0 6 28.000 29.85839 1.858393
## 8 Dog 27.1 6 28.000 29.85839 1.858393
## 9 Dog 25.2 6 28.000 29.85839 1.858393
## 10 Dog 21.8 6 28.000 29.85839 1.858393
## 11 Lion 55.3 2 52.750 33.49886 -19.251141
## 12 Lion 50.2 2 52.750 33.49886 -19.251141
As we can see on the table above, the discrete values with higher number observations almost didnt’t change the encoding values. In the other hand, the values with lower number of observations have a big change on their encoding values by moving them closer to the global average.
This makes sense for: Target Encoding
when it will be used on a Learning Model
because that way the Target Encoding
for variables with lower number of observations won’t have a high impact on the target variable.
My Goal
Given the last comparison table I need to calculate the value for column: encoding_blended
by using the values on previous columns.
Do you know what formula can I use to achieve this?