Apply t.test to all rows with the same name?

cwright1 · January 18, 2021, 10:07pm

I want to perform a t test of values based on matching row names coming from a specific column. Some example would be like this...
12 boys and 12 girls go and pick 4 types of flowers. The (arbitrary) size of the flower is listed in the 'boy' or 'girl' column.

mydf <- data.frame(boy=1:12, girl=13:24)
mydf$flower[1:3] <- c("lilly")
mydf$flower[4:6] <- c("rose")
mydf$flower[7:9] <- c("petunia")
mydf$flower[10:12] <- c("violet")

mydf
   boy girl  flower
1    1   13   lilly
2    2   14   lilly
3    3   15   lilly
4    4   16    rose
5    5   17    rose
6    6   18    rose
7    7   19 petunia
8    8   20 petunia
9    9   21 petunia
10  10   22  violet
11  11   23  violet
12  12   24  violet

I want to perform a t test by the flower type. I guess the hypothesis in my example would be something like, "boys pick different sized flowers than girls".

I imagine if I wanted to perform the t test with just rose, I could do:

rose_boy <- c(4,5,6)
rose_girl <- c(16,17,18)
t.test(rose_boy, rose_girl)

But how can I do this for all the flowers of the same type?

FJCC · January 18, 2021, 10:54pm

Here is one method using a custom function and the map function from purrr.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
mydf <- data.frame(boy=1:12, girl=13:24)
mydf$flower[1:3] <- c("lilly")
mydf$flower[4:6] <- c("rose")
mydf$flower[7:9] <- c("petunia")
mydf$flower[10:12] <- c("violet")

Flowers <- unique(mydf$flower)

MyFunc <- function(Nm, DF){
  tmp <-  DF %>% filter(flower == Nm)
  t.test(tmp$boy,tmp$girl)
}

TESTS <- map(Flowers, MyFunc, DF = mydf) #map sends each value in Flowers to MyFunc
# TESTS is a list of 4 elements, a t.test for each flower
names(TESTS) <- Flowers #set the names of TESTS
TESTS$rose
#> 
#>  Welch Two Sample t-test
#> 
#> data:  tmp$boy and tmp$girl
#> t = -14.697, df = 4, p-value = 0.0001247
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -14.266958  -9.733042
#> sample estimates:
#> mean of x mean of y 
#>         5        17

#Compare
rose_boy <- c(4,5,6)
rose_girl <- c(16,17,18)
t.test(rose_boy, rose_girl)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  rose_boy and rose_girl
#> t = -14.697, df = 4, p-value = 0.0001247
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -14.266958  -9.733042
#> sample estimates:
#> mean of x mean of y 
#>         5        17

^{Created on 2021-01-18 by the reprex package (v0.3.0)}

cwright1 · January 18, 2021, 11:03pm

@FJCC This is exactly what I was trying to accomplish - thank you!

I see from your solution that the p.value is contained with each list - is there a way I can extract this for al flowers and add it back to the original dataframe? I realize doing so would have replicated values, but that is completely ok.

For example, a new column pvals in the original mydf would have the value of 0.0001247 for all "rose" rows.

Again - thanks a lot!

FJCC · January 18, 2021, 11:51pm

Here are methods for returning just the p value from the t test or returning more complete information using the broom package. Note that the second output has the p value twice only because I reused mydf after joining the p values in the first method. Also, all of the p values are the same because the differences are constant across the groups.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(broom)
mydf <- data.frame(boy=1:12, girl=13:24)
mydf$flower[1:3] <- c("lilly")
mydf$flower[4:6] <- c("rose")
mydf$flower[7:9] <- c("petunia")
mydf$flower[10:12] <- c("violet")

Flowers <- unique(mydf$flower)

#Justthe pvalues
MyFunc <- function(Nm, DF){
  tmp <-  DF %>% filter(flower == Nm)
  tResult <- t.test(tmp$boy,tmp$girl)
  data.frame(flower = Nm, p_value = tResult$p.value)#return a data frame with the flower and the p value
}

TESTS <- map(Flowers, MyFunc, DF = mydf) #map sends each value in Flowers to MyFunc

names(TESTS) <- Flowers #set the names of TESTS
t_DF <- bind_rows(TESTS)

mydf <- inner_join(mydf, t_DF, by = "flower")
mydf
#>    boy girl  flower     p_value
#> 1    1   13   lilly 0.000124726
#> 2    2   14   lilly 0.000124726
#> 3    3   15   lilly 0.000124726
#> 4    4   16    rose 0.000124726
#> 5    5   17    rose 0.000124726
#> 6    6   18    rose 0.000124726
#> 7    7   19 petunia 0.000124726
#> 8    8   20 petunia 0.000124726
#> 9    9   21 petunia 0.000124726
#> 10  10   22  violet 0.000124726
#> 11  11   23  violet 0.000124726
#> 12  12   24  violet 0.000124726

#include more info
MyFunc2 <- function(Nm, DF){
  tmp <-  DF %>% filter(flower == Nm)
  tResult <- t.test(tmp$boy,tmp$girl)
  DFOut <- broom::glance(tResult)
  DFOut$flower <- Nm
  DFOut
}

TESTS2 <- map(Flowers, MyFunc2, DF = mydf)
t_DF2 <- bind_rows(TESTS2)

mydf2 <- inner_join(mydf, t_DF2, by = "flower")
mydf2
#>    boy girl  flower     p_value estimate estimate1 estimate2 statistic
#> 1    1   13   lilly 0.000124726      -12         2        14 -14.69694
#> 2    2   14   lilly 0.000124726      -12         2        14 -14.69694
#> 3    3   15   lilly 0.000124726      -12         2        14 -14.69694
#> 4    4   16    rose 0.000124726      -12         5        17 -14.69694
#> 5    5   17    rose 0.000124726      -12         5        17 -14.69694
#> 6    6   18    rose 0.000124726      -12         5        17 -14.69694
#> 7    7   19 petunia 0.000124726      -12         8        20 -14.69694
#> 8    8   20 petunia 0.000124726      -12         8        20 -14.69694
#> 9    9   21 petunia 0.000124726      -12         8        20 -14.69694
#> 10  10   22  violet 0.000124726      -12        11        23 -14.69694
#> 11  11   23  violet 0.000124726      -12        11        23 -14.69694
#> 12  12   24  violet 0.000124726      -12        11        23 -14.69694
#>        p.value parameter  conf.low conf.high                  method
#> 1  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 2  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 3  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 4  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 5  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 6  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 7  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 8  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 9  0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 10 0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 11 0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#> 12 0.000124726         4 -14.26696 -9.733042 Welch Two Sample t-test
#>    alternative
#> 1    two.sided
#> 2    two.sided
#> 3    two.sided
#> 4    two.sided
#> 5    two.sided
#> 6    two.sided
#> 7    two.sided
#> 8    two.sided
#> 9    two.sided
#> 10   two.sided
#> 11   two.sided
#> 12   two.sided

^{Created on 2021-01-18 by the reprex package (v0.3.0)}

system · January 25, 2021, 11:51pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.