Category: Entry Point into a Topic
Repo: GitHub - Ckrenzer/Making-An-NGram-Tokenizer: Walks through the process of creating an n-gram tokenizer.</t
Background
I am a big fan of Julia Silge's work. Her writings and tutorials not only show you how to be an advanced user of algorithms, but she also provides references for those looking to further their understanding of them. In the second chapter of the book she co-authored with Emil Hvitfeldt, Supervised Machine Learning For Text Analysis in R, Julia writes in great detail about how to create a tokenizer for text data. She explains how tokenizers work, the difficulties of tokenizing different human languages, and importantly, she shows us how to create our own tokenizers for our own use cases. In doing so, she touches upon one of the most important heuristics of the R language and indeed programming more broadly: functions gain flexibility at the expense of execution speed. This is not always the case, but it holds much of the time. In Chapter 24 of Hadley's Advanced R, Improving Performance, you see echoes of this claim. The more error-checking and custom features your function has, the more time your function spends ensuring customizability instead of computing the return value.
The goal, of course, is to have both--we want speed and customizability. Julia does a great job explaining the trade-offs inherent in the different tokenizer options out there, and she even provided us with code to build our own tokenizers. The 'hand-written' functions from the book break letters and individual words into tokens, but it did not provide us with a tokenizer for n-grams. In fact, I could not find anything online writing out the code to create an n-gram tokenizer--all the tutorials just showed me how to use tidytext::unnest_tokens()
. I searched through the depths of the tidytext
source code to find the unnest_tokens()
implementation to no avail--but it's in there somewhere! After reading her chapter, I wanted to build an n-gram tokenizer from the ground up, and I wanted it to be fast. This guide examines some clever ways to assemble n-grams that are viable competitors to mainstream alternatives on CRAN.
We will start by describing how to create a tokenizer, then we will write different implementations and close by benchmarking the functions.
The Process
Any value of n in n-grams can be constructed from unigrams. In practical terms, this means placing each word of a document into its own element of a vector and then pasting n adjacent words together to make the n-gram. For example, with n = 2 on the sentence, "My short and exquisite sentence", we should have {"My short", "short and", "and exquisite", "exquisite sentence"} after tokenizing. Think of the tokenizer as a slide that selects two (n) words at a time until each pair of words is selected.
One of the greatest bottlenecks in this approach is the step creating unigrams. If this can be bypassed (perhaps with a well-designed regular expression?), the implementations can speed up further.
Packages
Let's load in all the packages we'll be using:
if(!require(pacman)) install.packages("pacman")
# General Packages
pacman::p_load(stringr, dplyr, readr, data.table)
# Packages with n-gram tokenizers
pacman::p_load(tokenizers, tidytext)
# Benchmarking packages
pacman::p_load(microbenchmark)
Tokenizers
As mentioned above, the more restricted a function's output, the faster the function's execution speed--the more complex the data structure used, the longer it takes to run. Often, we will want to return a data frame instead of a character vector. This will cost us dearly, but the flexibility gained from doing so is often well-worth it.
Let's begin with a simple tokenizer and work our way up in complexity.
Heads & Tails
Often, we only need bigrams (n = 2). After splitting the text into unigrams, head()
and tail()
combine the words into bigrams quite nicely:
f1 <- function(words){
# the unigrams
words <- str_split(words, "\\s+")[[1]]
return(str_c(head(words, -1), tail(words, -1), sep = " ", collapse = NULL))
}
-
Split the string into unigrams.
-
Paste the vectors together using
head()
andtail()
. -
Bigrams (n = 2) only.
Nested Loops
For those of us who like to think about problems by tracing out each step, loops work wonders. It is not guaranteed to be a fast implementation in R, but this methodology translates nicely to other languages:
f2 <- function(words, n = 2){
# the unigrams
words <- str_split(words, "\\s+")[[1]]
n <- n - 1
vec <- character(length(words) - n)
for(i in 1:length(vec)){
for(j in i:(i + n)){
vec[i] <- str_c(vec[i], words[j], sep = " ")
}
}
vec <- str_remove(vec, "^\\s{1}")
return(vec)
}
-
Split the string into unigrams.
-
Use nested loops to assemble pairings of n words together.
-
Works for any value of n.
Shift Lists
Data.table's shift()
function provides us with a bona fide slider as described above. We can then use loop functions to assemble the n-grams. Do you keep a shift list?
f3 <- function(words, n = 2){
# the unigrams
words <- str_split(words, "\\s+")[[1]]
n <- n - 1
word_list <- lapply(shift(words, n = 0:n, type = 'lead'), na.omit)
mn <- min(lengths(word_list))
grams <- do.call(paste, lapply(word_list, head, mn))
return(grams)
}
-
Use
data.table::shift()
to put words into n lists. -
Use the length of the shortest list when calling the
head()
function and assemble the word vector withpaste()
anddo.call()
. -
Does not work for n = 1.
Tibbles
Of course, we might want to tokenize in a magrittr
pipeline. When this is the case, character vectors usually won't suffice. Returning more complex data types allows for greater flexibility, but we pay for it at run time. To drive this point home, the three above functions are adjusted to return a tibble, and f2_tibble_custom()
is a modified version of f2()
that allows tokenization of multiple documents at once, as well as control over the input column of text.
f1_tibble <- function(words){
# the unigrams
words <- str_split(words, "\\s+")[[1]]
return(tibble(text = str_c(head(words, -1), tail(words, -1), sep = " ", collapse = NULL)))
}
f2_tibble <- function(words, n = 2){
# the unigrams
words <- str_split(words, "\\s+")[[1]]
n <- n - 1
vec <- character(length(words) - n)
for(i in 1:length(vec)){
for(j in i:(i + n)){
vec[i] <- str_c(vec[i], words[j], sep = " ")
}
}
vec <- str_remove(vec, "^\\s{1}")
return(tibble(text = vec))
}
f3_tibble <- function(words, n = 2){
# the unigrams
words <- str_split(words, "\\s+")[[1]]
n <- n - 1
word_list <- lapply(shift(words, n = 0:n, type = 'lead'), na.omit)
mn <- min(lengths(word_list))
grams <- do.call(paste, lapply(word_list, head, mn))
return(tibble(text = grams))
}
f2_tibble_custom <- function(text_df, key_column, text_column, n = 2){
# the unigrams
words <- str_split(text_df[[text_column]], "\\s+", simplify = FALSE)
n <- n - 1
# results will be added to this data frame
ngram_df <- tibble(doc_id = character(0), text = character(0))
# performing this operation for each 'key' in the data frame
for(element in 1:length(words)){
# the company name is our key
key_name <- text_df[[key_column]][element]
# the n-grams are added to this vector
vec <- character(length(words[[element]]) - n)
for(i in 1:length(vec)){
for(j in i:(i + n)){
vec[i] <- str_c(vec[i], words[[element]][j], sep = " ")
}
}
ngram <- tibble(doc_id = key_name, text = str_remove(vec, "^\\s{1}"))
ngram_df <- bind_rows(ngram_df, ngram)
}
return(ngram_df)
}
Benchmarking
Data
To test our functions, we need data! Let's use my favorite novel, The Count of Monte Cristo. The version used is the same as one from the corpus
package's Project Gutenberg API, except the notes at the end of the book have been removed. The data is stored in a text file to ensure reproducibility.
The Count is the closest thing the 19th century had to the Most Interesting Man In The World; if you have time to spare, give it a read!
# The Count of Monte Cristo by Alexandre Dumas (English Translation)
cristo_text <- readr::read_lines("https://raw.githubusercontent.com/Ckrenzer/Making-An-NGram-Tokenizer/main/The%20Count%20of%20Monte%20Cristo.txt")
# We want the data represented in a single string for tokenization
cristo_text <- str_c(cristo_text, collapse = " ")
cristo <- tibble(doc_id = "cristo", text = cristo_text)
Finally, let's put our functions to the test! They will be compared, at n = 2, to each other and official tokenizers found on CRAN (tokenizers
for character vectors and tidytext
for tibbles). This takes a while to run.
# These functions return character vectors
microbenchmark::microbenchmark(
`Heads & Tails` = f1(words = cristo$text),
`Nested Loops` = f2(words = cristo$text, n = 2),
`Shift List` = f3(words = cristo$text, n = 2),
`tokenizers` = tokenizers::tokenize_ngrams(x = cristo$text, n = 2, simplify = FALSE)[[1]],
check = NULL,
times = 25,
unit = "s"
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Heads & Tails 0.2889884 0.3867171 0.4575909 0.4182707 0.5038394 0.7673346 25
#> Nested Loops 6.7113265 8.0643973 9.1981370 8.9747056 10.2201626 12.1695817 25
#> Shift List 0.4393635 0.5288969 0.7289757 0.6653517 0.8351631 1.5823200 25
#> tokenizers 0.3321225 0.4031210 0.4658819 0.4320680 0.5043797 0.8674276 25
# These functions return tibbles
microbenchmark::microbenchmark(
`Heads & Tails` = f1_tibble(words = cristo$text),
`Nested Loops` = f2_tibble(words = cristo$text, n = 2),
`Shift List` = f3_tibble(words = cristo$text, n = 2),
`Nested Loops Custom` = f2_tibble_custom(text_df = cristo,
key_column = "doc_id",
text_column = "text",
n = 2),
`tidytext` = tidytext::unnest_tokens(tbl = cristo,
input = text,
output = text,
token = "ngrams",
n = 2),
check = NULL,
times = 25,
unit = "s"
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Heads & Tails 0.2997065 0.3412449 0.3957833 0.3633596 0.4006367 0.6814346 25
#> Nested Loops 7.0280510 7.3828038 7.7136532 7.6348447 7.8778614 9.7611280 25
#> Shift List 0.3780660 0.4803673 0.5579947 0.5167267 0.5703890 0.8564233 25
#> Nested Loops Custom 7.1850311 7.5984540 7.9887214 7.9386131 8.1067919 9.5205180 25
#> tidytext 0.4621718 0.5305908 0.5828535 0.5530582 0.6049297 0.9781302 25
Depending on the day, Heads & Tails
runs faster than the equivalent code in tokenizers
. We beat code written in C++ with a bit of clever writing! The tokenizers
function only requires about half the memory of Heads & Tails
and can be used with any value of n, but we did this entirely with stringr
and base R!
Using a black box is much different from understanding an algorithm, but now you can at least see how tokenizers actually work!
This is a submission to the R Views Call for Documentation. For more information see rviews.rstudio.com.