unnest_tokens with token = "ngrams" will use behind the scene the tokenizer and tokenize_ngrams function. In the help page it is precised:
These functions will strip all punctuation and normalize all whitespace to a single space character.
You have no option to configure that currently. You have to use another option.
The ngram is one of the option as it allows more customization
txt <- "A 40-year-old R&D guy"
ng <- ngram(txt, n = 2)
#> An ngram object with 3 2-grams
print(ng, output = "full")
#> R&D guy | 1
#> NULL {1} |
#> 40-year-old R&D | 1
#> guy {1} |
#> A 40-year-old | 1
#> R&D {1} |
# I use rev to have in the order you want
ng_string <- rev(get.ngrams(ng))
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
You can provide a function as token argument. This function must work on a vector and return a list. See ?unnest_tokens
Here is an example:
txt <- "A 40-year-old R&D guy"
d <- tibble::data_frame(txt = txt)
# a function that takes a string an return list with each ngrams
ngram_string <- function(txt, n) list(unname(rev(ngram::get.ngrams(ngram::ngram(txt, n = n)))))
ngram_string(txt, 2)
#> [[1]]
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
# you need to make it vectorize
ngram_string_vec <- Vectorize(ngram_string, vectorize.args = "txt", USE.NAMES = FALSE)
# it works on vector now
ngram_string_vec(c(d$txt, d$txt), n = 2)
#> [[1]]
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
#> [[2]]
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
# it is ready to be applied as tokenizing function
tidytext::unnest_tokens(d, ngram, txt, token = ngram_string_vec, n = 2, to_lower = FALSE)
#> # A tibble: 3 x 1
#> ngram
#> <chr>
#> 1 A 40-year-old
#> 2 40-year-old R&D
#> 3 R&D guy
Thank you for your nice reply.
While I am applying your codes, I realized that my source input is not so qualified.
My real input data format is data frame but I overlook the fact that NROW changes after tokenization .
if the input is a data frame as below,
name <- c("John", "Edgar", "James" )
desc <- c("A 40-year-old R&D guy", "no valid information" "nothing")
hr_info <- data.frame(name, desc)
can you code a little more to get output as below?
John A 40-year-old
John 40-year-old R&D
John R&D guy
Edgar no valid
Edgar valid information
I think now you have all the tools you should try by yourself to deal with list of different size.
By tweaking a little bit in several steps the results of ngrams you should be able to get it working. Otherwise you cou try list column and unnesting afterwards.
I executed the below but ended up with "unable to find an inherited method for function ‘ngram’ for signature ‘"factor"’ which I can't debug. . That's why I asked additionally.
name <- c("John", "Edgar", "James" )
desc <- c("A 40-year-old R&D guy", "no valid information", "nothing")
hr_info <- data.frame(name, desc)
hr_info %>% tidytext::unnest_tokens(ngram2, name, token = ngram_string_vec, n = 2, to_lower = FALSE)
Show Traceback
Rerun with Debug
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘ngram’ for signature ‘"factor"’
Oh I see.
This is because data.frame converts string as Factor by default.
use data.frame with stringAsFactors = FALSE or use data_frame from tibble and dplyr in the tidyverse.
You'll get your columns as character and no more factors
Your function works in ideal case but I see some problem.
I tried to validate your functions before asking next questions. I hope this would be the last one.
#1. Error when the word is single
when txt is single word like "nothing", the script shows error as below
Error in ngram::ngram(txt, n = n) : input 'str' has nwords=1 and n=2; must have nwords >= n
I could filter out single word before applying your nice functions but would you make it skip(do nothing) for such condition?
#2. Error in showing results in order
When txt = "1 2 3 4 5 6 7 8 9 10", the result shows as below in random sequence.
In order to check the result is what I expect to have, it should be in order of FIFO(First In First Out).
Would you have a look to make it display in good order?
I think it is simple enough for you to use a if clause or something else to deal with what you want in a custom function
ngram may not deal very well with this... It is possible that the function ngram don't apply in order and results is not sort FIFO. You should dig into ngram. and maybe open a feature request in the package. You could switch back to tidytext if no punctuation character ?
You could also open an issue in tidytext to see if they could support optional punctuation removal.
I don't have all the answer for you, and can't do it on your behalf. I think you have now all the tools.
Due to words like "40-year-old" in my real cases, I can't go back to default tidytext. But don't have enough programming skills to my own derived ngram. So I follow your last suggestion to create an issue on 'ngram'.
Even though I don't get ordered results, I can make more reliable results with your suggestion. Thank you!