t-tests over several variables (using a tibble of lists)

courtiol · March 28, 2019, 10:38am

Dear all,
I am trying to run a t-tests over several measurements as a way to explore the power of tibbles containing lists in columns, which I find elegant.

In base R (not using a complex tibble or alike), I would do something like that:

library(broom)

traits <- colnames(iris)[colnames(iris) != "Species"]
list_ttests <- lapply(traits, function(trait) tidy(t.test(iris[iris$Species == "setosa", trait],
                                                          iris[iris$Species == "versicolor", trait])))
res1 <- cbind(Traits = traits, do.call("rbind", list_ttests))
colnames(res1)[colnames(res1) %in% c("estimate1", "estimate2")] <- c("mean_setosa", "mean_versicolor")
View(res1)

Trying to embrace tidyverse, I could only come up with the following:

library(tidyr) ## devel version tidyr_0.8.3.9000!
library(dplyr)
library(purrr)

iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  pivot_longer(cols = -Species, names_to = "variable", values_to = "value") %>%
  group_by(Species, variable) %>% 
  summarise(values = list(value)) %>%
  pivot_wider(names_from = Species, values_from = values) %>% ##tibble of lists ready
  bind_cols(pmap_df(., ~ tidy(t.test(unlist(..2), unlist(..3))))) %>%
  rename(mean_setosa = estimate1, mean_versicolor = estimate2) -> res2

View(res2)

It is not too horrendous, but it is still complicated for beginners.
I wonder if something more elegant could be done to arrive faster to the tibble of lists (i.e. not pivoting twice), and if in the call to pmap there would be way to use column names (setosa and versicolor) instead of the ugly ..1, ..2.

Any tips welcome! @mishabalyasin?

Thanks

tbradley · March 28, 2019, 12:30pm

You can do it like this. I have not switched to the development version of tidyr but if you replace my spread and gather with your pivot_longer and pivot_wider, respectively, it should work fine.

This approach uses nest via group_nest (which is the same as group_by() %>% nest()) to create list columns of all the different variables for both species. Then I used tidyr::crossing to cross the nested tibble against itself (hence the double periods) to get all of the combinations of variables. Then I filtered out the ones I don't want (you may have to change this if you want to compare variables between species rather than just the variables within species, as it is now doing) and then I use mutate + map to iterate over every row of the dataframe and perform t.tests for all of the crossing data. Check out the code below:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(broom)
library(purrr)

iris %>% 
  filter(Species %in% c("setosa", "versicolor")) %>% 
  gather(variable, value, -Species) %>% 
  group_nest(Species, variable) %>% 
  crossing(., .) %>% 
  filter(Species == Species1, 
         variable != variable1) %>% 
  mutate(t_test = map2(data, data1, ~{t.test(.x$value, .y$value) %>% broom::tidy()})) %>% 
  unnest(t_test, .drop = TRUE)
#> # A tibble: 24 x 14
#>    Species variable Species1 variable1 estimate estimate1 estimate2
#>    <fct>   <chr>    <fct>    <chr>        <dbl>     <dbl>     <dbl>
#>  1 setosa  Petal.L~ setosa   Petal.Wi~     1.22     1.46      0.246
#>  2 setosa  Petal.L~ setosa   Sepal.Le~    -3.54     1.46      5.01 
#>  3 setosa  Petal.L~ setosa   Sepal.Wi~    -1.97     1.46      3.43 
#>  4 setosa  Petal.W~ setosa   Petal.Le~    -1.22     0.246     1.46 
#>  5 setosa  Petal.W~ setosa   Sepal.Le~    -4.76     0.246     5.01 
#>  6 setosa  Petal.W~ setosa   Sepal.Wi~    -3.18     0.246     3.43 
#>  7 setosa  Sepal.L~ setosa   Petal.Le~     3.54     5.01      1.46 
#>  8 setosa  Sepal.L~ setosa   Petal.Wi~     4.76     5.01      0.246
#>  9 setosa  Sepal.L~ setosa   Sepal.Wi~     1.58     5.01      3.43 
#> 10 setosa  Sepal.W~ setosa   Petal.Le~     1.97     3.43      1.46 
#> # ... with 14 more rows, and 7 more variables: statistic <dbl>,
#> #   p.value <dbl>, parameter <dbl>, conf.low <dbl>, conf.high <dbl>,
#> #   method <chr>, alternative <chr>

Created on 2019-03-28 by the reprex package (v0.2.0).

courtiol · March 28, 2019, 1:49pm

Thanks @tbradley.
I don't see this as a real improvement aside the group_nest() replacement of group_by() %>% summarize().
Your proposition looks even more complex and implies to create things you delete.
Also it seems to break with recent versions of the packages:

library(broom) ## broom_0.5.1
library(tidyr) ## tidyr_0.8.3.9000
library(dplyr) ## dplyr_0.8.0.1
library(purrr) ## purrr_0.3.2  

iris %>% 
      filter(Species %in% c("setosa", "versicolor")) %>% 
      gather(variable, value, -Species) %>% 
      group_nest(Species, variable) %>% 
      crossing(., .)
#> Column name `.` must not be duplicated.
#> Use .name_repair to specify repair.

So my revised code is the following:

library(broom) ## broom_0.5.1
library(tidyr) ## tidyr_0.8.3.9000
library(dplyr) ## dplyr_0.8.0.1
library(purrr) ## purrr_0.3.2  

iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  pivot_longer(cols = -Species, names_to = "variable", values_to = "value") %>%
  group_nest(variable, Species, .key = "values") %>%
  pivot_wider(names_from = Species, values_from = values) %>%
  bind_cols(pmap_df(., ~ tidy(t.test(unlist(..2), unlist(..3))))) %>%
  rename(mean_setosa = estimate1, mean_versicolor = estimate2) -> res2

View(res2)

I would quite like to keep the intermediary output generated at the beginning, even if I don't like how to get there, since each row corresponds to a t-test:

iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  pivot_longer(cols = -Species, names_to = "variable", values_to = "value") %>%
  group_nest(variable, Species, .key = "values") %>%
  pivot_wider(names_from = Species, values_from = values)
#> # A tibble: 4 x 3
#>   variable     setosa            versicolor       
#>   <chr>        <list>            <list>           
#> 1 Petal.Length <tibble [50 × 1]> <tibble [50 × 1]>
#> 2 Petal.Width  <tibble [50 × 1]> <tibble [50 × 1]>
#> 3 Sepal.Length <tibble [50 × 1]> <tibble [50 × 1]>
#> 4 Sepal.Width  <tibble [50 × 1]> <tibble [50 × 1]>

aosmith · March 28, 2019, 2:22pm

I'm not sure I'm understanding the advantage of the list columns here based on your original example, although I think there are plenty of times they are useful.

In this case I might loop through each column of "traits" and use the species column for each one to filter correctly for the t test. Keeping this code all in a single pipe makes it harder to follow, I think, and I had to name arguments to subset the dataset in map2_dfr().

iris %>% 
    filter(Species %in% c("setosa", "versicolor") ) %>%
    map2_dfr(.x = select(., -Species), .y = list(.$Species), 
              .f = ~tidy(t.test(.x[.y == "setosa"], .x[.y == "versicolor"])),
              .id = "Traits")
# A tibble: 4 x 11
  Traits estimate estimate1 estimate2 statistic
  <chr>     <dbl>     <dbl>     <dbl>     <dbl>
1 Sepal~   -0.930     5.01       5.94    -10.5 
2 Sepal~    0.658     3.43       2.77      9.45
3 Petal~   -2.80      1.46       4.26    -39.5 
4 Petal~   -1.08      0.246      1.33    -34.1 
# ... with 6 more variables: p.value <dbl>,
#   parameter <dbl>, conf.low <dbl>, conf.high <dbl>,
#   method <chr>, alternative <chr>

tbradley · March 28, 2019, 2:46pm

My mistake on misreading your intent on what you were comparing. If you only want to compare the two groups than you don't need to use crossing (not sure why it doesn't work, as I am, again, not using the devel version yet). But I am not sure why you want to use bind_cols along with what seems to be a more complicated map statement, in my opinion.

As for keeping you intermediate data, just don't include .drop = TRUE in the unnest call and that will give you what you want.

library(tidyr)
library(ggplot2)
library(broom)
library(purrr)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>% 
  filter(Species %in% c("setosa", "versicolor")) %>% 
  gather(variable, value, -Species) %>% 
  group_nest(Species, variable) %>% 
  spread(Species, data) %>% 
  mutate(t_test = map2(setosa, versicolor, ~{tidy(t.test(.x$value, .y$value))})) %>% 
  unnest(t_test)
#> # A tibble: 4 x 13
#>   variable setosa versicolor estimate estimate1 estimate2 statistic
#>   <chr>    <list> <list>        <dbl>     <dbl>     <dbl>     <dbl>
#> 1 Petal.L~ <tibb~ <tibble [~   -2.80      1.46       4.26    -39.5 
#> 2 Petal.W~ <tibb~ <tibble [~   -1.08      0.246      1.33    -34.1 
#> 3 Sepal.L~ <tibb~ <tibble [~   -0.930     5.01       5.94    -10.5 
#> 4 Sepal.W~ <tibb~ <tibble [~    0.658     3.43       2.77      9.45
#> # ... with 6 more variables: p.value <dbl>, parameter <dbl>,
#> #   conf.low <dbl>, conf.high <dbl>, method <chr>, alternative <chr>

Created on 2019-03-28 by the reprex package (v0.2.0).

I think this kind of thing, a single model type applied for different groups fits exactly into the workflow of list columns.

aosmith · March 28, 2019, 2:51pm

Yes, you're right, and I agree. I think it was all the work reshaping things to force the list column approach that I don't understand compared to the original solution via lapply(). But if you're going to go this route I think your map2() solution is exactly what's needed.

courtiol · March 28, 2019, 2:54pm

Yes I agree @aosmith , @tbradley nailed exactly what I wanted.
I guess I will have to accept the call to gather() %>% ... %>% spread() that seems unavoidable to take this route.
Many thanks to both of you!

Here is the final outcome with the pivot_*():

iris %>% 
  filter(Species %in% c("setosa", "versicolor")) %>% 
  pivot_longer(cols = -Species, names_to = "variable", values_to = "value") %>%
  group_nest(Species, variable, .key = "values") %>% 
  pivot_wider(names_from = Species, values_from = values) %>%
  mutate(t_test = map2(setosa, versicolor, ~ tidy(t.test(.x$value, .y$value)))) %>% 
  unnest(t_test)

aosmith · March 28, 2019, 3:25pm

So you don't actually need to to the second reshaping step if you use the formula method of t.test() instead of the default "two numeric vector" method.

iris %>%
    filter(Species %in% c("setosa", "versicolor")) %>%
    pivot_longer(cols = -Species, names_to = "variable", values_to = "value") %>%
    group_nest(variable) %>%
    mutate(t.test = map(data, ~tidy(t.test(value ~ Species, data = .x)))) %>%
    unnest(t.test)

# A tibble: 4 x 12
  variable data  estimate estimate1 estimate2 statistic  p.value
  <chr>    <lis>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
1 Petal.L~ <tib~   -2.80      1.46       4.26    -39.5  9.93e-46
2 Petal.W~ <tib~   -1.08      0.246      1.33    -34.1  2.72e-47
3 Sepal.L~ <tib~   -0.930     5.01       5.94    -10.5  3.75e-17
4 Sepal.W~ <tib~    0.658     3.43       2.77      9.45 2.48e-15
# ... with 5 more variables: parameter <dbl>, conf.low <dbl>,
#   conf.high <dbl>, method <chr>, alternative <chr>

system · April 18, 2019, 3:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.