Hi,
I am trying to understand likelihood encodings for dealing with categorical data.
My understanding is this
-
Split your dataset: Divide your dataset into a training set and a validation/test set. This ensures that you calculate the encodings based only on the training set and avoid data leakage.
-
Calculate the likelihood encodings: For each category in a categorical variable, compute the likelihood of the target variable being in a particular state (e.g., the probability of a certain class label). This can be done by grouping the training data by the category and calculating the mean, median, or any other statistic of the target variable for each category.
-
Apply the encodings to the dataset: Replace the original categorical variable with the calculated encodings. For each category in the variable, replace it with the corresponding encoding value.
From my understanding the following R Code should work
# Load the required libraries
library(dplyr)
# Create a sample data frame with a categorical variable and a target variable
df <- data.frame(feature = c("A", "B", "C", "A", "B", "C"), target = c(1, 0, 1, 0, 1, 0))
# Define a function to perform likelihood encoding
likelihood_encode <- function(data, feature, target) {
# Calculate the mean target value for each level of the feature
means <- data %>%
group_by({{feature}}) %>%
summarise(mean_target = mean({{target}}), .groups = "drop")
# Merge the means back into the original data frame
data_calc <- data %>%
left_join(means, by = join_by(feature))
# Replace the levels of the feature with the mean target values
# Remove the original target variable
data <- data_calc %>%
mutate({{ feature }} := mean_target)
return(data)
}
# Apply the likelihood encoding function to the sample data frame
df_encoded <- likelihood_encode(df, feature, target)
In tidy-models I know you can apply this using something like step_lencode_mixed
. I have two questions
- Is my understanding correct?
- Is it possible to use this in conjunction with tokenization. For example if i am trying to categorize something such as UFO landings and I have all the street addresses in the world. I would tokenize the address by word for example then calculate the likelihood. In theory the street name would have a higher likelihood than say the number. I am asking this as I know GBMs for example tend to have degraded performance with high dimensional datasets.
Thank you for your time