for a text mining activity i need to extract topics from some emails. The corpus of my documents come from HTML code. Data are stored in a Cloudera Big Data Environment. The problem born when i import in R the HTML code's field. R trunc the string column, so i can only read some parts of the documents text.
Is there a lenght's threeshold for character variables in R? there a lenght's threeshold in Rstudio? there's a way to change this threeshold?
in other way i can parse the html with some Big Data environmente components like Hive or Spark and import in R only the term-documents matrix for analysis, but it is tricky to parse text for me and a long activities without R.
anyone can help me?
thanks in advance
have nice day
We don't really have enough info to help you out. Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.
If you've never heard of a reprex before, you might want to start by reading this FAQ: