Weighted Log Odds Between Groups

I modified code from Dr. Silge's Text Mining book, but chose to use weighted log odds instead of term frequency inverse document frequency due to this blog post (3.2 Weighted log odds ratio | Notes for “Text Mining with R: A Tidy Approach”). My code is below, but this is resulting in 2 values for the same term; if I want one value per term and the term assigned to the group that it is most likely to be in, how would I accomplish this?

library(tidylo)

words_by_weapon <- text_clean_words %>%
add_count(weapon, name = "total_words") %>%
group_by(weapon, total_words) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
filter(!word %in% stop_words$word)

wlo_all <- words_by_weapon %>%
bind_log_odds(set = weapon, feature = word, n = n)

Hi, where does text_clean_words come from? If it is your own data, can you provide an example of it? Thanks.

Link to csv is (PRACTICE/exampledataset.csv at master · elysethulin/PRACTICE · GitHub), and here is the full code which can be run on a local device:

library(dplyr)
library(tidyr)
library(tidylo)

leaves_clean <- read.csv("localfilepath/exampledataset.csv")

text_clean_words <- leaves_clean %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(weapon, word) %>%
ungroup()

wlo_all <- text_clean_words %>%
bind_log_odds(set = weapon, feature = word, n = n)

view(wlo_all)

##Then look at specific words, such as 'weed', and you should see two values; 0.9287126 for weapon==FALSE, and -0.7734112 for weapon==FALSE.

##My understanding of weighted log odds is that I am trying for a single value per word (as oppose to two rows for a single word, corresponding to the weapon status).

It looks like it is one per set and feature. For example:



## example ---------------------
library(janeaustenr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  add_count(book, name = "total_words") %>%
  group_by(book, total_words) %>%
  count(word, sort = TRUE) %>%
  ungroup()

book_words %>%
  bind_log_odds(set = book, feature = word, n = n) %>%
  arrange(desc(log_odds_weighted)) %>% 
  filter(word == "almost")


# A tibble: 6 × 5
  book                total_words word       n log_odds_weighted
  <fct>                     <int> <chr>  <int>             <dbl>
1 Northanger Abbey          77780 almost    60             0.627
2 Mansfield Park           160460 almost   124             0.479
3 Persuasion                83658 almost    60             0.384
4 Sense & Sensibility      119957 almost    85             0.257
5 Pride & Prejudice        122204 almost    59            -0.850
6 Emma                     160996 almost    88            -0.857

So one for each weapon for your data

library(tidyverse)
library(tidylo)
library(tidytext)

leaves_clean <- read.csv("https://raw.githubusercontent.com/elysethulin/PRACTICE/master/exampledataset.csv")

text_clean_words <- leaves_clean %>%
  mutate(weapon = if_else(weapon == "FLASE", "FALSE", weapon)) %>% # fix this
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(weapon, word) %>%
  ungroup()

wlo_all <- text_clean_words %>%
  bind_log_odds(set = weapon, feature = word, n = n) 

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.