Supervised Text Modelling Tidymodels - Can't convert <textrecipes_tokenlist....

john.smith · March 31, 2021, 10:20am

Hi,

I am trying to follow along with the book Supervised Machine Learning for Text Analysis in R and am tryng to build my first classifier. I basically am trying to predict if a customer complaint is about credit or something else.

I have taken the data from Consumer Financial Protection Bureau (CFPB) and have reduced it to a subsample for it to run quickly-ish.
When running it through the resamples, I get the following error

Slice013: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist> to .

Below is my code


library(readr)
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(janitor)

# To speed things up we take 1% of the rows per date
# producing a reduced complaints dataset
complaints <- read_csv("~/data/complaints.csv.zip") %>% 
  clean_names() %>% 
  filter(!is.na(consumer_complaint_narrative)) %>% 
  group_by(date_received) %>% 
  sample_frac(size = 0.01) %>% 
  ungroup() 

# Create the classification label and select only two columns to keep it simple
complaints <- complaints %>% 
  mutate(tgt_class  = 
           case_when(
             str_detect(product, 'Credit|personal consumer') ~ "Credit",
             TRUE ~ "Other")
         ) %>% 
  select(date_received, consumer_complaint_narrative, consumer_disputed, tgt_class) %>% 
  na.omit()

# Just double check that it looks sensible
table(complaints$tgt_class)
#> Credit  Other 
#>   3442   3301

head(complaints)
#> # A tibble: 6 x 4
#>   date_received consumer_complaint_narrative          consumer_disput~ tgt_class
#>   <date>        <chr>                                 <chr>            <chr>    
#> 1 2015-03-19    "I wrote to XXXX, asking them to sto~ No               Other    
#> 2 2015-03-19    "In XX/XX/XXXX my wages that I earne~ Yes              Other    
#> 3 2015-03-20    "I sent a letter and have yet to rec~ No               Other    
#> 4 2015-03-20    "I have inquiry alerts through my ba~ No               Other    
#> 5 2015-03-21    "Equifax has changed my student loan~ No               Credit   
#> 6 2015-03-22    "I HAVE A FRAUD ALERT ON ALL MY CRED~ Yes              Credit

# MODELLING ---------------------------------------------------------------
set.seed(1)
comp_split <- initial_split(complaints, strata = tgt_class)
comp_train <- training(comp_split) 
comp_test <- testing(comp_split)

# Set up the resamples based on time slices based on week
set.seed(2)
complaints_slices <- sliding_period(
  comp_train,
  date_received,
  "month",
  lookback = Inf,
  assess_stop = 1,
  skip = 3,
  step = 1
)

# Now create a very simple text recipe
comp_rec <- recipe(tgt_class ~., data = comp_train) %>%
  step_tokenize(consumer_complaint_narrative) %>% # Tokenizes to words by default
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 500) 

# Double check it tokenized
comp_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

#> # A tibble: 5,058 x 4
#>    date_received consumer_complaint_narrative consumer_disputed tgt_class
#>    <date>                           <tknlist> <fct>             <fct>    
#>  1 2015-03-19                     [21 tokens] No                Other    
#>  2 2015-03-19                    [549 tokens] Yes               Other    
#>  3 2015-03-20                    [102 tokens] No                Other    
#>  4 2015-03-20                    [128 tokens] No                Other    
#>  5 2015-03-21                     [93 tokens] No                Credit   
#>  6 2015-03-22                    [251 tokens] Yes               Credit   
#>  7 2015-03-23                    [232 tokens] No                Credit   
#>  8 2015-03-23                     [25 tokens] No                Other    
#>  9 2015-03-24                     [63 tokens] No                Credit   
#> 10 2015-03-24                     [73 tokens] No                Other    
#> # ... with 5,048 more rows
# Run the Random Forest

rf_spec <- 
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

rf_wflow <- # new workflow object
  workflow() %>% # use workflow function
  add_recipe(comp_rec) %>%   # use the new recipe
  add_model(rf_spec)   # add your model spec

rf_res <- 
  rf_wflow %>% 
  fit_resamples(
    resamples = complaints_slices, 
    metrics = metric_set(kap, roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
  ) 

#> x Slice01: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist...
#> x Slice02: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist...
#> x Slice03: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist...
#> x Slice04: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist...
#> x Slice05: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist...
#> x Slice06: preprocessor 1/1, model 1/1: Error: Can't convert <textrecipes_tokenlist...
#> ..........
#> Warning: All models failed. See the `.notes` column.

^{Created on 2021-03-31 by the reprex package (v1.0.0)}

Any help would be greatly appreciated

john.smith · April 1, 2021, 6:38am

I seemed to have been missing a step in my recipe in order to convert the tokens themselves. In my case i wanted the term frequency but in the case of the book chapter it actually uses step_tfidf()

So for completeness:

complaints_rec <- recipe(tgt_class ~., data = comp_train)  %>%
  step_tokenize(consumer_complaint_narrative) %>%
  step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
  step_tf(consumer_complaint_narrative) # This is the step that does the conversion

system · April 8, 2021, 6:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.