I modified code from Dr. Silge's Text Mining book, but chose to use weighted log odds instead of term frequency inverse document frequency due to this blog post (3.2 Weighted log odds ratio | Notes for “Text Mining with R: A Tidy Approach” ). My code is below, but this is resulting in 2 values for the same term; if I want one value per term and the term assigned to the group that it is most likely to be in, how would I accomplish this?
library(tidylo)
words_by_weapon <- text_clean_words %>%
add_count(weapon, name = "total_words") %>%
group_by(weapon, total_words) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
filter(!word %in% stop_words$word)
wlo_all <- words_by_weapon %>%
bind_log_odds(set = weapon, feature = word, n = n)
ethulin:
text_clean_words
Hi, where does text_clean_words
come from? If it is your own data, can you provide an example of it? Thanks.
A minimal reproducible example consists of the following items:
A minimal dataset, necessary to reproduce the issue
The minimal runnable code necessary to reproduce the issue, which can be run
on the given dataset, and including the necessary information on the used packages.
Let's quickly go over each one of these with examples:
Minimal Dataset (Sample Data)
You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue.
Let's say, as an example, that you are working with the iris data frame
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.…
Link to csv is (PRACTICE/exampledataset.csv at master · elysethulin/PRACTICE · GitHub ), and here is the full code which can be run on a local device:
library(dplyr)
library(tidyr)
library(tidylo)
leaves_clean <- read.csv("localfilepath/exampledataset.csv")
text_clean_words <- leaves_clean %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(weapon, word) %>%
ungroup()
wlo_all <- text_clean_words %>%
bind_log_odds(set = weapon, feature = word, n = n)
view(wlo_all)
##Then look at specific words, such as 'weed', and you should see two values; 0.9287126 for weapon==FALSE, and -0.7734112 for weapon==FALSE.
##My understanding of weighted log odds is that I am trying for a single value per word (as oppose to two rows for a single word, corresponding to the weapon status).
It looks like it is one per set and feature. For example:
## example ---------------------
library(janeaustenr)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
add_count(book, name = "total_words") %>%
group_by(book, total_words) %>%
count(word, sort = TRUE) %>%
ungroup()
book_words %>%
bind_log_odds(set = book, feature = word, n = n) %>%
arrange(desc(log_odds_weighted)) %>%
filter(word == "almost")
# A tibble: 6 × 5
book total_words word n log_odds_weighted
<fct> <int> <chr> <int> <dbl>
1 Northanger Abbey 77780 almost 60 0.627
2 Mansfield Park 160460 almost 124 0.479
3 Persuasion 83658 almost 60 0.384
4 Sense & Sensibility 119957 almost 85 0.257
5 Pride & Prejudice 122204 almost 59 -0.850
6 Emma 160996 almost 88 -0.857
So one for each weapon for your data
library(tidyverse)
library(tidylo)
library(tidytext)
leaves_clean <- read.csv("https://raw.githubusercontent.com/elysethulin/PRACTICE/master/exampledataset.csv")
text_clean_words <- leaves_clean %>%
mutate(weapon = if_else(weapon == "FLASE", "FALSE", weapon)) %>% # fix this
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(weapon, word) %>%
ungroup()
wlo_all <- text_clean_words %>%
bind_log_odds(set = weapon, feature = word, n = n)
system
Closed
September 7, 2022, 4:58am
5
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.