how to remove ellipses (...)

natu · February 13, 2020, 10:48pm

I'm cleaning my corpus but even after trying these code the ellipses (three dots ... ) are still there.

trump_text <- gsub("[[:punct:]]", "",trump_text)
trump_text <- tm_map(trump_text, removePunctuation)

any solution?

mattwarkentin · February 13, 2020, 11:06pm

Try this:

text <- "Here is some text...and here is some more"

gsub("[[:punct:]]+", '', text)
#> [1] "Here is some textand here is some more"

^{Created on 2020-02-13 by the reprex package (v0.3.0)}

technocrat · February 13, 2020, 11:08pm

library(stringr)
text1 <- "Here is a three-dot elipsis: ..."
text2 <- "Here is an elipsis character: \u2026"  
str_replace(text1, "\\.\\.\\.","")
#> [1] "Here is a three-dot elipsis: "
str_replace(text2, "\u2026","")
#> [1] "Here is an elipsis character: "

^{Created on 2020-02-13 by the reprex package (v0.3.0)}

natu · February 14, 2020, 12:07am

Tons of Thanks....problem got solved

nwerth · February 14, 2020, 2:54pm

I wouldn't be surprised if technocrat guessed the underlying problem. Messaging software may "autocorrect" three periods to a single ellipsis character.

I suggest the stringi package.

library(stringi)
ex <- "Here is an 'elipsis' character-byte: …"
punct_classes <- "[[:Pd:][:Ps:][:Pe:][:Pc:][:Po:][:Pi:][:Pf:]]"
stri_replace_all_regex(ex, punct_classes, "")
# [1] "Here is an elipsis characterbyte "

Those P-classes cover all punctuation characters, as defined by the Unicode standard. Check out the stringi documentation to see what each one means.

That same doc page warns against using [:punct:].

POSIX Character Classes

Avoid using POSIX character classes, e.g., [:punct:] . The ICU User Guide (see below) states that in general they are not well-defined, so you may end up with something different than you expect.

In particular, in POSIX-like regex engines, [:punct:] stands for the character class corresponding to the ispunct() classification function (check out man 3 ispunct on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), the ispunct() function tests for any printing character except for space or a character for which isalnum() is true. However, in a POSIX setting, the details of what characters belong into which class depend on the current locale. So the [:punct:] class does not lead to a portable code (again, in POSIX-like regex engines).

Therefore, a POSIX flavor of [:punct:] is more like [\p{P}\p{S}] in ICU. You have been warned.

system · February 21, 2020, 2:54pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.