Text mining: R limit on length of character variables?

martinocrippa · September 5, 2019, 9:00am

Hi All,

for a text mining activity i need to extract topics from some emails. The corpus of my documents come from HTML code. Data are stored in a Cloudera Big Data Environment. The problem born when i import in R the HTML code's field. R trunc the string column, so i can only read some parts of the documents text.

Is there a lenght's threeshold for character variables in R? there a lenght's threeshold in Rstudio? there's a way to change this threeshold?

in other way i can parse the html with some Big Data environmente components like Hive or Spark and import in R only the term-documents matrix for analysis, but it is tricky to parse text for me and a long activities without R.

anyone can help me?
thanks in advance
have nice day

MC

DavoWW · September 5, 2019, 9:40am

I can make VERY long single strings in code without any problem:

library(stringr)
aa <- rep("ABC", times=100000)
aa <- str_c(aa, collapse = "")
aa
length(aa)
nchar(aa)

This suggests to me the problem is with the import of the HTML.

HTH

andresrcs · September 5, 2019, 11:56am

Hi

We don't really have enough info to help you out. Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

system · September 26, 2019, 11:56am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.