Weighted Log Odds Between Groups

ethulin · August 16, 2022, 10:47pm

I modified code from Dr. Silge's Text Mining book, but chose to use weighted log odds instead of term frequency inverse document frequency due to this blog post (3.2 Weighted log odds ratio | Notes for “Text Mining with R: A Tidy Approach”). My code is below, but this is resulting in 2 values for the same term; if I want one value per term and the term assigned to the group that it is most likely to be in, how would I accomplish this?

library(tidylo)

words_by_weapon <- text_clean_words %>%
add_count(weapon, name = "total_words") %>%
group_by(weapon, total_words) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
filter(!word %in% stop_words$word)

wlo_all <- words_by_weapon %>%
bind_log_odds(set = weapon, feature = word, n = n)

williaml · August 17, 2022, 12:05am

Hi, where does text_clean_words come from? If it is your own data, can you provide an example of it? Thanks.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

ethulin · August 17, 2022, 1:01am

Link to csv is (PRACTICE/exampledataset.csv at master · elysethulin/PRACTICE · GitHub), and here is the full code which can be run on a local device:

library(dplyr)
library(tidyr)
library(tidylo)

leaves_clean <- read.csv("localfilepath/exampledataset.csv")

text_clean_words <- leaves_clean %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(weapon, word) %>%
ungroup()

wlo_all <- text_clean_words %>%
bind_log_odds(set = weapon, feature = word, n = n)

view(wlo_all)

##Then look at specific words, such as 'weed', and you should see two values; 0.9287126 for weapon==FALSE, and -0.7734112 for weapon==FALSE.

##My understanding of weighted log odds is that I am trying for a single value per word (as oppose to two rows for a single word, corresponding to the weapon status).

williaml · August 17, 2022, 4:58am

It looks like it is one per set and feature. For example:



## example ---------------------
library(janeaustenr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  add_count(book, name = "total_words") %>%
  group_by(book, total_words) %>%
  count(word, sort = TRUE) %>%
  ungroup()

book_words %>%
  bind_log_odds(set = book, feature = word, n = n) %>%
  arrange(desc(log_odds_weighted)) %>% 
  filter(word == "almost")


# A tibble: 6 × 5
  book                total_words word       n log_odds_weighted
  <fct>                     <int> <chr>  <int>             <dbl>
1 Northanger Abbey          77780 almost    60             0.627
2 Mansfield Park           160460 almost   124             0.479
3 Persuasion                83658 almost    60             0.384
4 Sense & Sensibility      119957 almost    85             0.257
5 Pride & Prejudice        122204 almost    59            -0.850
6 Emma                     160996 almost    88            -0.857

So one for each weapon for your data

library(tidyverse)
library(tidylo)
library(tidytext)

leaves_clean <- read.csv("https://raw.githubusercontent.com/elysethulin/PRACTICE/master/exampledataset.csv")

text_clean_words <- leaves_clean %>%
  mutate(weapon = if_else(weapon == "FLASE", "FALSE", weapon)) %>% # fix this
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(weapon, word) %>%
  ungroup()

wlo_all <- text_clean_words %>%
  bind_log_odds(set = weapon, feature = word, n = n)

system · September 7, 2022, 4:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.