Creating a Matrix from Tidytext unnest tokens

john.smith · February 27, 2019, 2:23pm

Hi,

I am currently working my way through the book Text Mining with R and am at the tokenizing portion of the book. My question may appear a bit simplistic but bare with me.

In the example below, we take a column text and tokenize it into two ngrams. If i wished to model something like this for classification, i would need to take these tokens and convert them to a matrix of 1s and 0s where my original column has the bigram or not (1 where it does, 0 where it does not). Does anyone know how to accomplish this.


library(janeaustenr)
library(tidyverse)
library(tidytext)
d <- tibble(txt = prideprejudice)

d %>%
  unnest_tokens(bigram, txt, token = "ngrams", n = 2)

mara · February 27, 2019, 2:46pm

Have you gotten to the "Converting to and from non-tidy formats" section yet?

https://www.tidytextmining.com/images/tidyflow-ch-5.png

Looks like you need to count() and then cast_dtm() or cast_dfm() depending on what you want.

john.smith · February 27, 2019, 3:11pm

Hi @mara

I got to that section but for some reason could not get it to work (my fault entirely :))
I found an excellent book here that covers it

http://uc-r.github.io/creating-text-features

Thanks for your help

system · March 6, 2019, 3:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.