I'm cleaning my corpus but even after trying these code the ellipses (three dots ... ) are still there.
trump_text <- gsub("[[:punct:]]", "",trump_text)
trump_text <- tm_map(trump_text, removePunctuation)
any solution?
I'm cleaning my corpus but even after trying these code the ellipses (three dots ... ) are still there.
trump_text <- gsub("[[:punct:]]", "",trump_text)
trump_text <- tm_map(trump_text, removePunctuation)
any solution?
Hi @natu,
Try this:
text <- "Here is some text...and here is some more"
gsub("[[:punct:]]+", '', text)
#> [1] "Here is some textand here is some more"
Created on 2020-02-13 by the reprex package (v0.3.0)
library(stringr)
text1 <- "Here is a three-dot elipsis: ..."
text2 <- "Here is an elipsis character: \u2026"
str_replace(text1, "\\.\\.\\.","")
#> [1] "Here is a three-dot elipsis: "
str_replace(text2, "\u2026","")
#> [1] "Here is an elipsis character: "
Created on 2020-02-13 by the reprex package (v0.3.0)
Tons of Thanks....problem got solved
I wouldn't be surprised if technocrat guessed the underlying problem. Messaging software may "autocorrect" three periods to a single ellipsis character.
I suggest the stringi
package.
library(stringi)
ex <- "Here is an 'elipsis' character-byte: …"
punct_classes <- "[[:Pd:][:Ps:][:Pe:][:Pc:][:Po:][:Pi:][:Pf:]]"
stri_replace_all_regex(ex, punct_classes, "")
# [1] "Here is an elipsis characterbyte "
Those P
-classes cover all punctuation characters, as defined by the Unicode standard. Check out the stringi
documentation to see what each one means.
That same doc page warns against using [:punct:]
.
POSIX Character Classes
Avoid using POSIX character classes, e.g.,
[:punct:]
. The ICU User Guide (see below) states that in general they are not well-defined, so you may end up with something different than you expect.In particular, in POSIX-like regex engines,
[:punct:]
stands for the character class corresponding to theispunct()
classification function (check outman 3 ispunct
on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), theispunct()
function tests for any printing character except for space or a character for whichisalnum()
is true. However, in a POSIX setting, the details of what characters belong into which class depend on the current locale. So the[:punct:]
class does not lead to a portable code (again, in POSIX-like regex engines).Therefore, a POSIX flavor of
[:punct:]
is more like[\p{P}\p{S}]
in ICU. You have been warned.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.