Hey everyone!
I am working with a dataset that has the column "Column_1" which contains "Value1,Value2" in one string. Both Value1 and Value2 are levels of the same factor. Now, what i would like to do, is combine tokenizing and the hashing trick. However, when splitting the two values into new columns using something as simple as a mutate(), they receive different column identities. textrecipes provides tokenizing through step_tokenize, which unfortunately results in "textrecipes_tokenlist", which aren't usable by step_dummy_hash.
I could try and write the whole procedure on my own using base R, but I'm trying to find out whether a tidyverse approach is possible.
Here's a minimal reproducible example:
library(recipes)
library(textrecipes)
library(tidymodels)
# Sample data
df <- tibble(
id = 1:3,
colors = c("red,blue", "green,yellow", "red,green"),
outcome = factor(c(1, 0, 1))
)
# This doesn't work - tokenlist is incompatible with step_dummy_hash
recipe_failing <- recipe(outcome ~ ., data = df) |>
step_tokenize(colors, token = "regex", options = list(pattern = ",")) |>
step_dummy_hash(colors)
# Gives this error:
# Error in step_dummy_hash() : Error in `step_dummy_hash()`:
# Caused by error in `prep()`:
# ✖ All columns selected for the step should be string, factor, or
# ordered.
# • 1 tokenlist variable found: `colors`
prepped <- try(prep(recipe_failing))
The issue is that step_tokenize()
creates a special tokenlist object, but step_dummy_hash()
can't process these - it expects regular character/factor data.
Thank you for any suggestions or ideas in advance!