I need to preprocess textual data.
df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)
I used the following source code but this is slow. For 73000 rows
, I needed 13 minutes
to clean up.
library("tm")
vector_corpus <- VCorpus(DataframeSource(df$text))
cleanCorpus <- function(vector_corpus) {
cleaned_sqc.tmp <- tm_map(vector_corpus, content_transformer(tolower)) ## Transform to lower case
cleaned_sqc.tmp <- replaceWord(cleaned_sqc.tmp, "\\\\n", "") ## Replace Word
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_number))
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_abbreviation))
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_contraction))
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, content_transformer(replace_symbol))
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, removeNumbers) ## Remove Numbers
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, removePunctuation) ## Remove Punctuation
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, removeWords, stopwords("en")) ## Remove Stopwords
cleaned_sqc.tmp <- tm_map(cleaned_sqc.tmp, stripWhitespace) ## Strip whitespace
return(cleaned_sqc.tmp)
}
cleaned <- cleanCorpus(vectorCorpus)
Can this be still be speed up?
Also,
I need to POS Tag these textual data. For 73000 rows
, I needed 30 minutes.
library("textreg")
library("rJava")
library("RDRPOSTagger")
library("tokenizers")
cleaned_tm <- convert.tm.to.character(cleaned)
sentences <- tokenize_sentences(cleaned_tm, simplify = TRUE)
print("POS Tagging Started")
start <- Sys.time()
unipostagger <- rdr_model(language = "English", annotation = "UniversalPOS")
unipostags <- rdr_pos(unipostagger, sentences, doc_id=names(cleaned_tm))
I tried the open.NLP but this returns this error for 29000 rows.
R on MacOS Error: vector memory exhausted (limit reached?)
I am using MacOS but the solution provided above is not working for me.