Speed Up Textual Data Preprocessing and POS Tagging

xiaoni · February 28, 2020, 1:37pm

I need to preprocess textual data.

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

I used the following source code but this is slow. For 73000 rows, I needed 13 minutes to clean up.

library("tm")
vector_corpus <- VCorpus(DataframeSource(df$text))
cleanCorpus <- function(vector_corpus) {
  cleaned_sqc.tmp <- tm_map(vector_corpus, content_transformer(tolower)) ## Transform to lower case
  cleaned_sqc.tmp <- replaceWord(cleaned_sqc.tmp, "\\\\n", "") ## Replace Word
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_number))
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_abbreviation))
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_contraction))
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_symbol))
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, removeNumbers) ## Remove Numbers
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, removePunctuation) ## Remove Punctuation
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, removeWords, stopwords("en")) ## Remove Stopwords
  cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, stripWhitespace) ## Strip whitespace
  return(cleaned_sqc.tmp)
}

cleaned <- cleanCorpus(vectorCorpus)

Can this be still be speed up?

Also,
I need to POS Tag these textual data. For 73000 rows, I needed 30 minutes.

library("textreg")
library("rJava")
library("RDRPOSTagger")
library("tokenizers")
  cleaned_tm <- convert.tm.to.character(cleaned)
  sentences <- tokenize_sentences(cleaned_tm, simplify = TRUE)
  print("POS Tagging Started")
  start <- Sys.time()
  unipostagger <- rdr_model(language = "English", annotation = "UniversalPOS")
  unipostags <- rdr_pos(unipostagger, sentences, doc_id=names(cleaned_tm))

I tried the open.NLP but this returns this error for 29000 rows.

R on MacOS Error: vector memory exhausted (limit reached?)

I am using MacOS but the solution provided above is not working for me.

nirgrahamuk · February 28, 2020, 6:31pm

For that you probably want to understand which functions reduce the text and which make it longer. I.e. removing stop words should happen early so that you need to lowercase fewer words. Replacing numbers with words should happen as late as possible. Etc.

system · March 20, 2020, 6:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.