Calculating the Blended Average similarly to H2O

tlg265 · September 28, 2019, 7:25pm

library("dplyr")
library("data.table")
library("h2o")
h2o.init(nthreads = -1)

I have the following data frame:

df = data.frame(
  animal = as.factor(c("Dog", "Cat", "Lion", "Dog", "Cat", "Dog", "Cat", "Dog", "Dog", "Cat", "Lion", "Dog")),
  rating = as.double(c(25.2, 15.8, 55.3, 29.0, 23.4, 33.0, 22.6, 31.9, 21.8, 28.5, 50.2, 27.1))
) %>% arrange(desc(rating))
print(df)

##    animal rating
## 1    Lion   55.3
## 2    Lion   50.2
## 3     Dog   33.0
## 4     Dog   31.9
## 5     Dog   29.0
## 6     Cat   28.5
## 7     Dog   27.1
## 8     Dog   25.2
## 9     Cat   23.4
## 10    Cat   22.6
## 11    Dog   21.8
## 12    Cat   15.8

Then, by using H2O , I can create the encoding_map :

encoding_map = h2o.target_encode_create(
  as.h2o(df),
  x = list("animal"),
  y = "rating"
)

encoding_map

## $animal
##   animal numerator denominator
## 1    Cat      90.3           4
## 2    Dog     168.0           6
## 3   Lion     105.5           2
## 
## [3 rows x 3 columns]

`Target Encoding` | `Basic`

If I want to apply the most basic Target Encoding I can do the following.

df_h2o_encoded_1 = h2o.target_encode_apply(
  data = as.h2o(df),
  x = list("animal"),
  y = "rating",
  target_encode_map = encoding_map,
  holdout_type = "None", # using None for simplicity
  blended_avg = FALSE,
  noise = 0,
  seed = 1234
)

df_encoded_1 = as.data.frame(df_h2o_encoded_1)
df_encoded_1 = cbind(data.frame(id = 1:nrow(df_encoded_1)), df_encoded_1)
df_encoded_1

##    id animal rating TargetEncode_animal
## 1   1    Cat   28.5              22.575
## 2   2    Cat   23.4              22.575
## 3   3    Cat   22.6              22.575
## 4   4    Cat   15.8              22.575
## 5   5    Dog   33.0              28.000
## 6   6    Dog   31.9              28.000
## 7   7    Dog   29.0              28.000
## 8   8    Dog   27.1              28.000
## 9   9    Dog   25.2              28.000
## 10 10    Dog   21.8              28.000
## 11 11   Lion   55.3              52.750
## 12 12   Lion   50.2              52.750

We can get the same result very easily with paper and pencil. We just need to calculate the average for each category under: animal . That’s it.

`Target Encoding` | `Blended Average`

On the encoding_map above, we can see the category: Lion shows up on just a few of observations (comparing with other values). Then, the calculated average could be unreliable.

In order to handle this, H2O supports the parameter: blended_avg as we can see on:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/blended_avg.html

There we can read:

“The blended_avg parameter defines whether the target average should be weighted based on the count of the group. It is often the case, that some groups may have a small number of records and the target average will be unreliable. To prevent this, the blended average takes a weighted average of the group’s target value and the global target value.”

Let’s use the param: blended_avg :

df_h2o_encoded_2 = h2o.target_encode_apply(
  data = as.h2o(df),
  x = list("animal"),
  y = "rating",
  target_encode_map = encoding_map,
  holdout_type = "None", # using None for simplicity
  blended_avg = TRUE,
  noise = 0,
  seed = 1234
)

df_encoded_2 = as.data.frame(df_h2o_encoded_2)
df_encoded_2 = cbind(data.frame(id = 1:nrow(df_encoded_2)), df_encoded_2)
df_encoded_2

##    id animal rating TargetEncode_animal
## 1   1    Cat   28.5            29.01621
## 2   2    Cat   23.4            29.01621
## 3   3    Cat   22.6            29.01621
## 4   4    Cat   15.8            29.01621
## 5   5    Dog   33.0            29.85839
## 6   6    Dog   31.9            29.85839
## 7   7    Dog   29.0            29.85839
## 8   8    Dog   27.1            29.85839
## 9   9    Dog   25.2            29.85839
## 10 10    Dog   21.8            29.85839
## 11 11   Lion   55.3            33.49886
## 12 12   Lion   50.2            33.49886

Comparing both `Target Encoding` values

df_comp = df_encoded_1
df_comp = df_comp %>% left_join(as.data.frame(encoding_map$animal), by = "animal") %>% select(-numerator)
df_comp = df_comp %>% left_join(df_encoded_2[,c("id", "TargetEncode_animal")], by = "id") %>% select(-c(id))
colnames(df_comp) = c("animal", "rating", "encoding_simple", "num_obs", "encoding_blended")
setcolorder(df_comp, c("animal", "rating", "num_obs", "encoding_simple", "encoding_blended"))
df_comp = df_comp %>% mutate(encoding_diff = encoding_blended - encoding_simple)
df_comp

##    animal rating num_obs encoding_simple encoding_blended encoding_diff
## 1     Cat   28.5       4          22.575         29.01621      6.441209
## 2     Cat   23.4       4          22.575         29.01621      6.441209
## 3     Cat   22.6       4          22.575         29.01621      6.441209
## 4     Cat   15.8       4          22.575         29.01621      6.441209
## 5     Dog   33.0       6          28.000         29.85839      1.858393
## 6     Dog   31.9       6          28.000         29.85839      1.858393
## 7     Dog   29.0       6          28.000         29.85839      1.858393
## 8     Dog   27.1       6          28.000         29.85839      1.858393
## 9     Dog   25.2       6          28.000         29.85839      1.858393
## 10    Dog   21.8       6          28.000         29.85839      1.858393
## 11   Lion   55.3       2          52.750         33.49886    -19.251141
## 12   Lion   50.2       2          52.750         33.49886    -19.251141

As we can see on the table above, the discrete values with higher number observations almost didnt’t change the encoding values. In the other hand, the values with lower number of observations have a big change on their encoding values by moving them closer to the global average.

This makes sense for: Target Encoding when it will be used on a Learning Model because that way the Target Encoding for variables with lower number of observations won’t have a high impact on the target variable.

My Goal

Given the last comparison table I need to calculate the value for column: encoding_blended by using the values on previous columns.

Do you know what formula can I use to achieve this?

Thanks!

system · October 19, 2019, 7:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Calculating the Blended Average similarly to H2O

Target Encoding | Basic

Target Encoding | Blended Average

Comparing both Target Encoding values

My Goal

Thanks!

`Target Encoding` | `Basic`

`Target Encoding` | `Blended Average`

Comparing both `Target Encoding` values