Combining tokenizing and feature hashing

CatraMyBeloved · May 2, 2025, 8:17pm

Hey everyone!
I am working with a dataset that has the column "Column_1" which contains "Value1,Value2" in one string. Both Value1 and Value2 are levels of the same factor. Now, what i would like to do, is combine tokenizing and the hashing trick. However, when splitting the two values into new columns using something as simple as a mutate(), they receive different column identities. textrecipes provides tokenizing through step_tokenize, which unfortunately results in "textrecipes_tokenlist", which aren't usable by step_dummy_hash.

I could try and write the whole procedure on my own using base R, but I'm trying to find out whether a tidyverse approach is possible.

Here's a minimal reproducible example:

library(recipes)
library(textrecipes)
library(tidymodels)

# Sample data
df <- tibble(
  id = 1:3,
  colors = c("red,blue", "green,yellow", "red,green"),
  outcome = factor(c(1, 0, 1))
)

# This doesn't work - tokenlist is incompatible with step_dummy_hash
recipe_failing <- recipe(outcome ~ ., data = df) |>
  step_tokenize(colors, token = "regex", options = list(pattern = ",")) |>
  step_dummy_hash(colors)

# Gives this error:
# Error in step_dummy_hash() : Error in `step_dummy_hash()`:
# Caused by error in `prep()`:
# ✖ All columns selected for the step should be string, factor, or
#   ordered.
# • 1 tokenlist variable found: `colors`
prepped <- try(prep(recipe_failing))

The issue is that step_tokenize() creates a special tokenlist object, but step_dummy_hash() can't process these - it expects regular character/factor data.

Thank you for any suggestions or ideas in advance!

Emilhvitfeldt · May 2, 2025, 8:45pm

You need to use step_texthash() instead of step_dummy_hash(). step_dummy_hash() takes factors and characters as input, then hashes the whole thing. step_texthash() takes tokenized input and hashes each individually as you want

library(recipes)
library(textrecipes)
library(tidymodels)

# Sample data
df <- tibble(
  id = 1:3,
  colors = c("red,blue", "green,yellow", "red,green"),
  outcome = factor(c(1, 0, 1))
)

recipe_failing <- recipe(outcome ~ ., data = df) |>
  step_tokenize(colors, token = "regex", options = list(pattern = ",")) |>
  step_texthash(colors)

prep(recipe_failing) |>
  bake(new_data = NULL)
#> # A tibble: 3 × 1,026
#>      id outcome texthash_colors_0001 texthash_colors_0002 texthash_colors_0003
#>   <int> <fct>                  <int>                <int>                <int>
#> 1     1 1                          0                    0                    0
#> 2     2 0                          0                    0                    0
#> 3     3 1                          0                    0                    0
#> # ℹ 1,021 more variables: texthash_colors_0004 <int>,
#> #   texthash_colors_0005 <int>, texthash_colors_0006 <int>,
#> #   texthash_colors_0007 <int>, texthash_colors_0008 <int>,
#> #   texthash_colors_0009 <int>, texthash_colors_0010 <int>,
#> #   texthash_colors_0011 <int>, texthash_colors_0012 <int>,
#> #   texthash_colors_0013 <int>, texthash_colors_0014 <int>,
#> #   texthash_colors_0015 <int>, texthash_colors_0016 <int>, …

^{Created on 2025-05-02 with reprex v2.1.1}

CatraMyBeloved · May 2, 2025, 8:47pm

Thank you very much! That works perfectly.

system · May 9, 2025, 8:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.