Replace words in a tweet with numeric value of their frequency

Pingi · July 24, 2021, 2:30pm

Hello community!

I am facing a problem to replace words in a tweet with the numeric value of their frequency.

I have already made a data frame showing the words ranked by their frequency.

Now I want to substitute the words in the tweets with the frequency rank of every word.

I attached snips of my data frames.

Tweets and word frequency data:

My goal is that a tweets looks like this:

[1] [3] [7] [11] [18] [12] [10] [5] [3] [44] [23] [46] [2] [90]

The [1] means that it is the most frequent word in the dataset.

Any help appreciated!

HanOostdijk · July 24, 2021, 6:14pm

Maybe this helps

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# generate frequency table
freq <- data.frame(
  stringsAsFactors = F,
  words = c("die","rt","der"),
  n = c(12870,10190,6598)
) %>% 
  arrange(desc(n)) %>%
  mutate(nr=row_number())

# generate tweet table from character tweet
tweet <- "Bitte rt der Tweet"
tweettxt <- data.frame(
  stringsAsFactors = F,
  tweetwords = (strsplit(tweet," ")[[1]])
)

# combine the two tables: column `n` will contain the frequencies, `nr` the ranks
tweetnum <- tweettxt %>%
  left_join(freq,by=c('tweetwords'='words')) %>%
  mutate (n = ifelse(is.na(n),0,n),
          nr = ifelse(is.na(nr),Inf,nr))

tweetnum
#>   tweetwords     n  nr
#> 1      Bitte     0 Inf
#> 2         rt 10190   2
#> 3        der  6598   3
#> 4      Tweet     0 Inf
Created on 2021-07-24 by the reprex package (v2.0.0)

Pingi · July 26, 2021, 9:38am

Hey HanOostdijk,

thank for your Help, but unfortunately that is not my solution.

My goal is that every word in the twitter dataset (Text) is replaced by their frequency rank of the whole twitter dataset (15000 tweets).

I already ranked all words by their frequency. The whole dataset has 18420 words, so every word in the text got a rank by their frequency.

Now I want that the words in the tweets are replaced by the frequency rank. So every word is replaced by the rank of the word frequency, between 1 and 18420.

For example the first tweets should look like this:

Original tweet:
Apropos #baerbockfails Von #CDU und #CSU ist die sogenannte "bürgerliche Mitte" Betrug und Trickserei gewöhnt.

Frequency rank of the words:
Apropos: 2890
baerbockfails: 2629
Von: 14
CDU: 8
und: 6
CSU: 48
ist: 13
die: 1
sogenannte: 1282
bürgerliche: 2460
Mitte: 972
Betrug: 1733
und: 6
Trickserei: 16698
gewöhnt: 11959

The transformed tweet should then look like this:
Words replaced by word frequency rank:
[2890] [2629] [14] [8] [6] [48] [13] [1] [1282] [2460] [972] [1733] [6] [16698] [11959]

I hope I could clarify my point and looking forward for every help!

HanOostdijk · July 26, 2021, 9:58am

Hello @Pingi ,
I think I understand your problem.
Because you did not provide a reprex (always much appreciated) I created one where my small freq data.frame takes the place of your 18420 words dataset.
If you want to change the format to the one with [ and ] you can use

tweetchar = paste("[",tweetnum$nr,"]",sep='',collapse = ' ')
tweetchar
[1] "[Inf] [2] [3] [Inf]"
>

gitdemont · July 26, 2021, 10:14am

Hi,

# we use OpenRepGrid to generate random words
random_words = OpenRepGrid::randomWords(1000)

# then we get a dictionary with frequency of each word
freq = table(random_words)

# text with some words from random_words
tweet = "what happened to Mr. Johnson"

# text is splitted
tweet_words = strsplit(tweet, "\\s")[[1]]

# we replace word in the text by the frequency
paste0("[", na.omit(freq[tweet_words]),"]", collapse = "") # na.omit is here to remove words that are not in the dictionary

Pingi · July 26, 2021, 11:53am

@HanOostdijk ,

thank you so much, that worked!

A last step would be to do this for every tweet of the dataset and transfer the values of each tweet in a dataset with the format of the original twitter dataset.
So there is the value for the first tweet in the first row, the second value in the second row for the second tweet and so on...

My goal is that it looks like this:
Bildschirmfoto 2021-07-26 um 13.51.36

Is there a way to do this?

system · August 16, 2021, 11:54am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.