Hi,
I am exploring the package textrecipes
within the tidymodels
ecosystem.
I wish to tune several options within the tokenizing but am becoming a bit stuck
Lets say I have a dataframe with two columns of reviews from two difference newspapers for burger places.
It looks like the below
I am trying to predict if a customer will go to the burger places after reading the reviews.
This is a made up nonsense dateset just for illustration
burger_id newspaper_1_review newspaper_2_review cust_go
1 This is a review This is a second review 'Y'
2 This is a review This is a second review 'N'
3 This is a review This is a second review 'Y'
I have set up my recipe and tidymodels like below and was wondering how can i tune the tokenization of both newspaper reviews separately
I have made up and example of pseudo code below which doesn't work in the slightest
library(tidymodels)
library(tidyverse)
xgb_rec <- recipe(cust_go ~ newspaper_1_review + newspaper_2_review) %>%
# First newspaper to tune
step_tokenize(newspaper_1_review) %>%
step_ngram(newspaper_1_review, newspaper_1_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>%
step_tokenfilter(newspaper_1_review, newspaper_1_review_max_tokens = tune(max_tokens), min_times = 5) %>%
step_tf(newspaper_1_review)
# Second newspaper to tune
step_tokenize(newspaper_2_review) %>%
step_ngram(newspaper_2_review, newspaper_2_review_num_tokens = tune(num_tokens), min_num_tokens = 1) %>%
step_tokenfilter(newspaper_2_review, newspaper_2_review_max_tokens = tune(max_tokens), min_times = 5) %>%
step_tf(newspaper_2_review), data = mydf)
# boilerplate
xgb_spec <-
boost_tree(trees = 1300, min_n = 6, mtry = 15, learn_rate = 0.01
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# This is the part I'm a bit all over the place with
xgb_grid <- grid_max_entropy(
newspaper_1_review_num_tokens(),
newspaper_1_review_max_tokens(),
newspaper_2_review_num_tokens(),
newspaper_2_review_max_tokens(),
size = 10
)
xgb_wf <- workflow() %>%
add_recipe(xgb_rec) %>%
add_model(xgb_spec)
ctrl <- control_grid(verbose = FALSE, save_pred = TRUE)
set.seed(345)
xgb_rs <- tune_grid(
xgb_wf,
resamples = train_fold,
grid = xgb_grid,
metrics = mset,
control = ctrl
)
Thank you for your time