Select emojis and add space

jpcronin · February 22, 2021, 10:03pm

I have a dataframe of tweets, many with emojis in the text field, that I want to tokenize using tidytext. Many of the emojis lack a space between them and other emojis/text, making it hard to tokenize.

example <- "My priorities Saftey First\U0001f1fa\U0001f1f8\U0001f64f What were yours?"

I would like to be able to use str_replace_all (or another option) to add a space before the emoji as below:

"My priorities Saftey First \U0001f1fa \U0001f1f8 \U0001f64f What were yours?"

I have tried using the following but get an error:

str_replace_all(example, "\\U", " \\U")
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
  Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE)

Working off of this example I also tried the below but did not seem to alter the text.

str_replace_all(test, "\\\\U", " \\\\U")

mhakanda · March 13, 2021, 7:16pm

In order to identify \\U, I assume it is non-ascii. I split it, identify them, modify them and then back in the right form. I assume, there is a better a way to do that.

library(tidyverse)
example <- "My priorities Saftey First\U0001f1fa\U0001f1f8\U0001f64f What were yours?"
example
#> [1] "My priorities Saftey First\U0001f1fa\U0001f1f8\U0001f64f What were yours?"
splitwords<-strsplit(example,split = "")[[1]]
nonascii<-grepl("[^\001-\177]", splitwords)

d1<- map2_chr(splitwords, nonascii, function(x,y){if_else(y, paste(" ", x, sep = ""), x)})
d1 %>% paste(collapse = "")
#> [1] "My priorities Saftey First \U0001f1fa \U0001f1f8 \U0001f64f What were yours?"

system · April 3, 2021, 7:16pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.