unnest_tokens problem with keyword of "R&D"

YB.Kim · January 9, 2019, 5:19am

For example, I would like to split “A 40-year-old R&D guy” into “A 40-year-old”, “40-year-old R&D”, “R&D guy” ONLY by space character.

But when I use unnest_tokens(ngram, txt, token = "ngrams", n = 2),

The function automatically replace &(ampersand) and –(dash) into a space and result shows as below.

A 40

40 year

Year old

Old r

R D

D guy.

Please help me have result of paired-word splitted only by space

cderv · January 9, 2019, 6:23am

unnest_tokens with token = "ngrams" will use behind the scene the tokenizer and tokenize_ngrams function. In the help page it is precised:

These functions will strip all punctuation and normalize all whitespace to a single space character.

You have no option to configure that currently. You have to use another option.

The ngram is one of the option as it allows more customization

txt <- "A 40-year-old R&D guy"

library(ngram)
ng <- ngram(txt, n = 2)
ng
#> An ngram object with 3 2-grams
print(ng, output = "full")
#> R&D guy | 1 
#> NULL {1} | 
#> 
#> 40-year-old R&D | 1 
#> guy {1} | 
#> 
#> A 40-year-old | 1 
#> R&D {1} |

# I use rev to have in the order you want
ng_string <- rev(get.ngrams(ng))
ng_string
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"

^{Created on 2019-01-09 by the reprex package (v0.2.1)}

How to use it with `unnest_tokens` ?

You can provide a function as token argument. This function must work on a vector and return a list. See ?unnest_tokens
Here is an example:

txt <- "A 40-year-old R&D guy"

d <- tibble::data_frame(txt = txt)
# a function that takes a string an return list with each ngrams
ngram_string <- function(txt, n) list(unname(rev(ngram::get.ngrams(ngram::ngram(txt, n = n)))))
ngram_string(txt, 2)
#> [[1]]
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"
# you need to make it vectorize
ngram_string_vec <- Vectorize(ngram_string, vectorize.args = "txt", USE.NAMES = FALSE)
# it works on vector now
ngram_string_vec(c(d$txt, d$txt), n = 2)
#> [[1]]
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"        
#> 
#> [[2]]
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"

# it is ready to be applied as tokenizing function
tidytext::unnest_tokens(d, ngram, txt, token = ngram_string_vec, n = 2, to_lower = FALSE)
#> # A tibble: 3 x 1
#>   ngram          
#>   <chr>          
#> 1 A 40-year-old  
#> 2 40-year-old R&D
#> 3 R&D guy

^{Created on 2019-01-09 by the reprex package (v0.2.1)}

YB.Kim · January 10, 2019, 1:54am

Thank you for your nice reply.
While I am applying your codes, I realized that my source input is not so qualified.

My real input data format is data frame but I overlook the fact that NROW changes after tokenization .

if the input is a data frame as below,
name <- c("John", "Edgar", "James" )
desc <- c("A 40-year-old R&D guy", "no valid information" "nothing")
hr_info <- data.frame(name, desc)

can you code a little more to get output as below?
John A 40-year-old
John 40-year-old R&D
John R&D guy
Edgar no valid
Edgar valid information

cderv · January 10, 2019, 7:53am

I think now you have all the tools you should try by yourself to deal with list of different size.

By tweaking a little bit in several steps the results of ngrams you should be able to get it working. Otherwise you cou try list column and unnesting afterwards.

YB.Kim · January 10, 2019, 8:11am

Hi, cderv.

I executed the below but ended up with "unable to find an inherited method for function ‘ngram’ for signature ‘"factor"’ which I can't debug. . That's why I asked additionally.

name <- c("John", "Edgar", "James" )
desc <- c("A 40-year-old R&D guy", "no valid information", "nothing")
hr_info <- data.frame(name, desc)
hr_info %>% tidytext::unnest_tokens(ngram2, name, token = ngram_string_vec, n = 2, to_lower = FALSE)
Show Traceback

Rerun with Debug
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘ngram’ for signature ‘"factor"’

cderv · January 10, 2019, 8:37am

Oh I see.
This is because data.frame converts string as Factor by default.
use data.frame with stringAsFactors = FALSE or use data_frame from tibble and dplyr in the tidyverse.
You'll get your columns as character and no more factors

YB.Kim · January 11, 2019, 1:41am

Thank you for your persistent support.

Your function works in ideal case but I see some problem.
I tried to validate your functions before asking next questions. I hope this would be the last one.

#1. Error when the word is single
when txt is single word like "nothing", the script shows error as below
Error in ngram::ngram(txt, n = n) : input 'str' has nwords=1 and n=2; must have nwords >= n

I could filter out single word before applying your nice functions but would you make it skip(do nothing) for such condition?

#2. Error in showing results in order
When txt = "1 2 3 4 5 6 7 8 9 10", the result shows as below in random sequence.

A tibble: 9 x 1

ngram0

1 2 3
2 6 7
3 5 6
4 3 4
5 4 5
6 9 10
7 8 9
8 1 2
9 7 8

In order to check the result is what I expect to have, it should be in order of FIFO(First In First Out).
Would you have a look to make it display in good order?

cderv · January 11, 2019, 6:55am

I think it is simple enough for you to use a if clause or something else to deal with what you want in a custom function

ngram may not deal very well with this... It is possible that the function ngram don't apply in order and results is not sort FIFO. You should dig into ngram . and maybe open a feature request in the package. You could switch back to tidytext if no punctuation character ?
You could also open an issue in tidytext to see if they could support optional punctuation removal.

I don't have all the answer for you, and can't do it on your behalf. I think you have now all the tools.

YB.Kim · January 14, 2019, 2:45am

Thank you for your persistent answers.

Due to words like "40-year-old" in my real cases, I can't go back to default tidytext. But don't have enough programming skills to my own derived ngram. So I follow your last suggestion to create an issue on 'ngram'.

Even though I don't get ordered results, I can make more reliable results with your suggestion. Thank you!

cderv · January 14, 2019, 8:32pm

For reference, I post your issue here:

system · February 4, 2019, 8:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

unnest_tokens problem with keyword of "R&D"

How to use it with unnest_tokens ?

A tibble: 9 x 1

How to use it with `unnest_tokens` ?