textrecipies tuning multiple columns

john.smith · September 26, 2021, 9:14pm

Hi,

I am exploring the package textrecipes within the tidymodels ecosystem.

I wish to tune several options within the tokenizing but am becoming a bit stuck

Lets say I have a dataframe with two columns of reviews from two difference newspapers for burger places.

It looks like the below

I am trying to predict if a customer will go to the burger places after reading the reviews.

This is a made up nonsense dateset just for illustration


burger_id   newspaper_1_review      newspaper_2_review              cust_go
1           This is a review        This is a second review         'Y'
2           This is a review        This is a second review         'N'
3           This is a review        This is a second review         'Y'

I have set up my recipe and tidymodels like below and was wondering how can i tune the tokenization of both newspaper reviews separately

I have made up and example of pseudo code below which doesn't work in the slightest


library(tidymodels)
library(tidyverse)

xgb_rec <- recipe(cust_go ~  newspaper_1_review + newspaper_2_review) %>%
                
# First newspaper to tune    
                step_tokenize(newspaper_1_review) %>%
                step_ngram(newspaper_1_review, newspaper_1_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>%
                step_tokenfilter(newspaper_1_review, newspaper_1_review_max_tokens = tune(max_tokens), min_times = 5) %>%
                step_tf(newspaper_1_review)

                # Second newspaper to tune
                step_tokenize(newspaper_2_review) %>%
                step_ngram(newspaper_2_review, newspaper_2_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>%
                step_tokenfilter(newspaper_2_review, newspaper_2_review_max_tokens = tune(max_tokens), min_times = 5) %>%
                step_tf(newspaper_2_review), data = mydf)

# boilerplate
xgb_spec <-
  boost_tree(trees = 1300, min_n = 6, mtry = 15, learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# This is the part I'm a bit all over the place with
xgb_grid <- grid_max_entropy(
  newspaper_1_review_num_tokens(),
  newspaper_1_review_max_tokens(),
  newspaper_2_review_num_tokens(),
  newspaper_2_review_max_tokens(),
  size = 10
)

xgb_wf <- workflow() %>%
  add_recipe(xgb_rec) %>%
  add_model(xgb_spec)

ctrl <- control_grid(verbose = FALSE, save_pred = TRUE)

set.seed(345)
xgb_rs <- tune_grid(
  xgb_wf,
  resamples = train_fold,
  grid = xgb_grid,
  metrics = mset,
  control = ctrl
)

Thank you for your time

system · October 17, 2021, 9:15pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.