get_p_value directions don't add up

I would expect that when calculating get_p_value(direction = both) would be equal to adding up the tails, i.e., summing up the get_p_value when the direction is "greater" than the observed value and get_p_value when the direction is "less" than the (negative of the) observed value. But it turns out that this doesn't quite work.
Here's my code below. You can see that the p-value for the right tail is 0.284 whereas the p-value for the left tail is 0.294. But when calculating both, we get a p-value that is 0.568, which is twice the right tail rather than adding up both tails. What gives?

set.seed(123)
df <- tribble(~group, ~value, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   0,  
              0,   0,  
              1,   0, 
              1,   0, 
              1,   0, 
              1,   0,
              1,   1, 
              1,   1) |> 
  mutate(group = as.factor(group))

observed_diff <- df %>%
  specify(value ~ group) %>%
  calculate("diff in means", order = c("0", "1")) |> 
  pull(stat)

diff.under.null <- df |> 
  specify(value ~ group) |>
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate("diff in means", order = c("0", "1")) 

diff.under.null |> 
  get_p_value(
    obs_stat = observed_diff, 
    direction = "greater") |> pull(p_value)
#> [1] 0.284

diff.under.null |> 
  get_p_value(
    obs_stat = - observed_diff, 
    direction = "less") |> pull(p_value)
#> [1] 0.294


diff.under.null |> 
  get_p_value(
    obs_stat = observed_diff, 
    direction = "both") |> pull(p_value)
#> [1] 0.568
# 0.284  + 0.294

A simulation-based method is being used to calculate the p-values. As a result, some simulation error is to be expected (although I am surprised it is as high as you are finding). Change reps to 100,000 and see if the answers are closer. (Probably going to need to go get a coffee while it's running. :grinning:)

Thanks @startz but my issue is not with the size of the p-values per-se, but with the relationship between the p-value associated with the right tail (direction = greater), the left tail (direction = less) and both tails (direction = both).

You would expect that both tails would be the sum of the right tail and the left tail. But it turns out to be simply twice the left-tail. Isn't that strange?

There is little point to using R to re-invent statistics. The depth and breadth of the statistical procedures provided through {base} and {stats} and the wealth of packages is the whole point to using R. Although it can be tortured into doing anything at all, R is a crappy procedural language and there are better tools with which to reinvent wheels. So, here's how to do t.tests for this dataset, if it made any sense to begin (more on that later)

d <- tibble::tribble(~group, ~value, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   0,  
              0,   0,  
              1,   0, 
              1,   0, 
              1,   0, 
              1,   0,
              1,   1, 
              1,   1)
m <- as.matrix(d)

p_two  <- t.test(group ~ value,m, alternative = "two.sided")$p.value
p_less <- t.test(group ~ value,m, alternative = "less")$p.value
p_more <- t.test(group ~ value,m, alternative = "greater")$p.value
p_two; p_less; p_more
#> [1] 0.2896916
#> [1] 0.8551542
#> [1] 0.1448458

identical(p_less + p_more,1)
#> [1] TRUE
identical(p_two,p_more * 2)
#> [1] TRUE

Created on 2023-06-30 with reprex v2.0.2

To see why adding both tails don't sum to one, consider the relationship among the three "alternative" arguments. Only two of them sum to 1. The law of the excluded middle has no scope here.

Even though it illustrates the tail question, the t.test shows nothing about the data because, at a minimum, the data are not continuous. These are not data that could have taken on any value in the range 0:1β€”they are binary. The t.test performs a test that it was not designed for.

Instead

(ct <- table(m[,1], m[,2]))
#>    
#>     0 1
#>   0 2 4
#>   1 4 2
chisq.test(ct)
#> Warning in chisq.test(ct): Chi-squared approximation may be incorrect
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  ct
#> X-squared = 0.33333, df = 1, p-value = 0.5637
prop.test(ct)
#> Warning in prop.test(ct): Chi-squared approximation may be incorrect
#> 
#>  2-sample test for equality of proportions with continuity correction
#> 
#> data:  ct
#> X-squared = 0.33333, df = 1, p-value = 0.5637
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#>  -1.000000  0.366768
#> sample estimates:
#>    prop 1    prop 2 
#> 0.3333333 0.6666667

There's even less point to re-implementing standard tests for the purpose of applying them to inappropriate data.

You are right. My answer was wrong. Since all the tests use the same generated values, monte carlo error can't explain what's going on.

@technocrat Since the t-test assumes a t-distribution, which is symmetric, by-definition the p-value associated with the left-tail and the p-value associated with the right-tail will be equal, and adding them together would be equal to the sum of the two tails. But bootstrapping distribution is not always symmetric, and when it is not symmetric, the infer library does not appears to be calculating the p-value as I would expect it to... Is that an error?


set.seed(123)
df <- tribble(~group, ~value, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   0,  
              0,   0,  
              1,   0, 
              1,   0, 
              1,   0, 
              1,   0,
              1,   1, 
              1,   1) |> 
  mutate(group     = factor(group, levels=c("0", "1")), 
         rev.group = factor(group, levels=c("1", "0"))) 

p_both.tails  <- t.test(value ~ group, data = df, 
                 alternative = "two.sided")$p.value

p_right.tail <- t.test(value ~ group, data = df, 
                       alternative = "greater")$p.value

p_left.tail <- t.test(value ~ rev.group, data = df, 
                      alternative = "less")$p.value

waldo::compare(p_both.tails, p_right.tail + p_left.tail)
#> βœ” No differences

NB Notice in principle, the t-test isn't useful here because its assumptions (normal distribution of the outcome variable ).

1 Like

Interestingly, when switching the order of the groups (levels = c("1", "0")) instead of the other way around, we find that the p-value associated with both tails is twice the p-value of the left-tail (and not the right-tail, like in the initial example):


set.seed(123)
df <- tribble(~group, ~value, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   0,  
              0,   0,  
              1,   0, 
              1,   0, 
              1,   0, 
              1,   0,
              1,   1, 
              1,   1)  |> 
  mutate(group     = factor(group))


diff.under.null <- df |> 
  specify(value ~ group) |>
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate("diff in means", order = c("1", "0")) 

observed_diff <- df %>%
  specify(value ~ group) %>%
  calculate("diff in means", order = c("1", "0")) |> 
  pull(stat)


p_right.tail.permute <-  diff.under.null |> 
  get_p_value(
    obs_stat = - observed_diff, 
    direction = "greater") |> pull(p_value)
# 0.275

p_left.tail.permute <-  diff.under.null |> 
  get_p_value(
    obs_stat = observed_diff, 
    direction = "less") |> pull(p_value)
# 0.284

p_both.tails.permute <- diff.under.null |> 
  get_p_value(
    obs_stat = observed_diff, 
    direction = "both") |> pull(p_value)
# 0.568

waldo::compare(p_both.tails.permute, 
               p_left.tail.permute + p_right.tail.permute)
#> `old`: 0.57
#> `new`: 0.58

waldo::compare(p_both.tails.permute, 2 * p_left.tail.permute)
#> βœ” No differences

Finally, the code below shows exactly how the p-values are calculated. As I suspected, in this case the tails of the permuation distribution are not equal, not even in the limit, and the two-sided p-value is not really calculating the p-value of both sides. Instead, it is multiplying the right side twice. This is almost surely an error.
I opened an issue on github, let's see what happens now...

library(tidyverse)
library(infer)

set.seed(123)
df <- tribble(~group, ~value, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   1, 
              0,   0,  
              0,   0,  
              1,   0, 
              1,   0, 
              1,   0, 
              1,   0,
              1,   1, 
              1,   1) |> 
  mutate(group     = factor(group) )


observed_diff <- df %>%
  specify(value ~ group) %>%
  calculate("diff in means", order = c("0", "1")) |> 
  pull(stat)



diff.under.null <- df |> 
  specify(value ~ group) |>
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate("diff in means", order = c("0", "1")) 

diff.under.null |> count(stat)
#> Response: value (numeric)
#> Explanatory: group (factor)
#> Null Hypothesis: independence
#> # A tibble: 6 Γ— 2
#>     stat     n
#>    <dbl> <int>
#> 1 -1         3
#> 2 -0.667    38
#> 3 -0.333   253
#> 4  0       422
#> 5  0.333   254
#> 6  0.667    30

p_right.tail.permute <-  diff.under.null |> 
  get_p_value(
    obs_stat = observed_diff, 
    direction = "greater") |> pull(p_value)

waldo::compare(p_right.tail.permute, (254 + 30)/1000)
#> βœ” No differences


p_left.tail.permute <-  diff.under.null |> 
  get_p_value(
    obs_stat = - observed_diff, 
    direction = "less") |> pull(p_value)
waldo::compare(p_left.tail.permute, (3 + 38 + 253)/1000)
#> βœ” No differences

p_both.tails.permute <- diff.under.null |> 
  get_p_value(
    obs_stat = observed_diff, 
    direction = "both") |> pull(p_value)

waldo::compare(p_both.tails.permute, 
               p_left.tail.permute + p_right.tail.permute)
#> `old`: 0.57
#> `new`: 0.58
waldo::compare(p_both.tails.permute, 2 * p_right.tail.permute)
#> βœ” No differences

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.