I'm new to ML. Suppose I want to do sentiment analysis of, say tweets. Using the blogs and tutorials at tidymodels.org I know I want to tokenize the tweets , create a DFM (or DTM) with one tweet per row and every word in the training set a column (aka feature) and then train the model on the DFM. So far, so good. None of the examples go to the step of fitting the model to test data.
What trips me up is that the test set will of course not have the same words/features as the training set so I can't fit the test set using the training set model. predict(trained_model,test_set_dfm) complains that columns are missing. You might say just create a DFM with all the words in both but what happens when new data that introduces a new word is fit? Or, what happens when, to shrink the DFM to a manageable size, I train on only the top-occurring words?
You are going to get data-leakage if you create your document feature matrix before you do the split. If you are working within the tidymodels framework, I recommend that you use the textrecipes package to create features from text. This ensures that you do the same calculations during testing and training. Thus making it so you avoid most issues.
The below example is simplistic on purpose but it should steer you in the right direction.
library(tidymodels)
library(textrecipes)
data("tate_text", package = "modeldata")
tate_text <- tate_text |>
select(medium, year) |>
mutate(year = if_else(year > 2000, "2000s", "1900s"))
tate_text
#> # A tibble: 4,284 × 2
#> medium year
#> <fct> <chr>
#> 1 Video, monitor or projection, colour and sound (stereo) 1900s
#> 2 Etching on paper 1900s
#> 3 Etching on paper 1900s
#> 4 Etching on paper 1900s
#> 5 Oil paint on canvas 1900s
#> 6 Oil paint on canvas 1900s
#> 7 Acrylic paint on paper 1900s
#> 8 Woodcut on paper 1900s
#> 9 Oil paint and wax on canvas 1900s
#> 10 Print on paper 1900s
#> # ℹ 4,274 more rows
set.seed(1234)
tate_split <- initial_split(tate_text)
tate_train <- training(tate_split)
tate_test <- testing(tate_split)
rec <- recipe(year ~ medium, data = tate_train) |>
step_tokenize(medium) |>
step_tokenfilter(medium, max_tokens = 20) |>
step_tf(medium)
lr_spec <- logistic_reg()
wf_spec <- workflow() |>
add_recipe(rec) |>
add_model(lr_spec)
wf_fit <- fit(wf_spec, data = tate_train)
predict(wf_fit, new_data = tate_train)
#> # A tibble: 3,213 × 1
#> .pred_class
#> <fct>
#> 1 1900s
#> 2 1900s
#> 3 1900s
#> 4 1900s
#> 5 2000s
#> 6 2000s
#> 7 2000s
#> 8 1900s
#> 9 1900s
#> 10 1900s
#> # ℹ 3,203 more rows
predict(wf_fit, new_data = tate_test)
#> # A tibble: 1,071 × 1
#> .pred_class
#> <fct>
#> 1 2000s
#> 2 1900s
#> 3 1900s
#> 4 2000s
#> 5 1900s
#> 6 1900s
#> 7 1900s
#> 8 1900s
#> 9 1900s
#> 10 1900s
#> # ℹ 1,061 more rows
Thank you for the response. I am using text recipes was following the example you cite but I guess I'm unclear about some things.
If I understand you correctly, a one-token-per-feature approach, as this is, precludes presenting the model with data outside of the original data set. So, again talking about tweets, once I train my model a new tweet that comes in containing any new words would not be predict-able? Expanding on your example:
> tate_new = tibble(medium = "Finger paint on sofa", year = "2000s")
> predict(wf_fit, new_data = tate_new)
# A tibble: 1 × 1
.pred_class
<fct>
1 1900s
Warning message:
Novel levels found in column 'medium': 'Finger paint on sofa'. The levels have been removed, and values have been coerced to 'NA'.
>
The code you show obviously works but since the recipe specifies modeling on tate_train it has no awareness of tate_test. Why doesn't that run into the data leakage problem you cite?
Along those lines, how does step_tokenfilter() work. If it selects the top 20 tokens in the training set, by frequency, all of those tokens are not guaranteed to appear in the test set, yet predict on the test set doesn't throw an error. Why?
Sorry about that, that is an oversight on my end. You will need to make sure that your text column (in this case medium) is being passed in as a character variable not a factor. Then it should work
If I understand you correctly, a one-token-per-feature approach, as this is, precludes presenting the model with data outside of the original data set. So, again talking about tweets, once I train my model a new tweet that comes in containing any new words would not be predict-able? Expanding on your example:
Above was my mistake with the characters/factors issue. But to expand. These models works by looking at how often different works appear, then using that information to set the weights for the model. If it encounters a new word in the testing data set, it will simply be ignored before the model has zero information about that word.
The code you show obviously works but since the recipe specifies modeling on tate_train it has no awareness of tate_test. Why doesn't that run into the data leakage problem you cite?
data leakage is what happens if you include information about the testing data into model training. For this example, if you were to let the model know that sofa was word it could encounter later, it might change how the model would be fit, hence "leaking". (this specific recipe doesn't have much leakage opportunities, but as the general principle it applies)
Along those lines, how does step_tokenfilter() work. If it selects the top 20 tokens in the training set, by frequency, all of those tokens are not guaranteed to appear in the test set, yet predict on the test set doesn't throw an error. Why?
step_tokenfilter() works by counting the tokens in the training data set passed to it. For this set of arguments, it finds what the 20 most common tokens, and filters the tokens to only allow these tokens to pass through. The tokens are not guaranteed to be in the test set, and that is okay because it is is filter.
you will have a hard time trying to create a model that is able to work well on data that is drastically different than the model it is trained on.