Conditional Probability with dplyr

dsollberger · February 12, 2018, 7:51am

For an introduction to probability, I am experimenting with using dplyr (well, tidyverse) to connect programming concepts to the idea of conditional probability. In my code below, I am using mutate to store numbers that I need later (simply the "numerator" and the "denominator"). My query is this: does anyone have a cleaner way of doing this calculation?

Example: Compute the probability that a randomly selected passenger on the Titanic was female given that the passenger was at least 35 years old.

library("tidyverse") #for data wrangling tools
library("titanic")
tdf <- titanic_train #training set of Titanic data

conditional_probability <- tdf %>%
  filter(Age >= 35) %>%
  mutate(denominator = n()) %>%
  filter(Sex == "female") %>%
  mutate(numerator = n()) %>%
  summarize(unique(numerator/denominator))

jlacko · February 12, 2018, 1:01pm

This is a verbose, but not necessarily wrong approach. In fact might be a good idea for students familiar with concept of probability but learning their way with R / dplyr / tidyverse.

I will be interested in other opinions.

tbradley · February 12, 2018, 1:35pm

You can use sum with your summarize call to do this all in one step:

tdf %>%
  summarize(prob = sum(Age >= 35 & Sex == "female", na.rm = TRUE)/sum(Age >= 35, na.rm = TRUE))

mine · February 13, 2018, 3:02am

I was just teaching conditional probabilities today! I chose the following method

library(tidyverse)
library(titanic)

titanic_train %>%
  filter(
    !is.na(Sex), 
    !is.na(Age)
  ) %>%
  mutate(age_cat = ifelse(Age >= 35, "at least 35", "less than 35")) %>%
  count(age_cat, Sex) %>%
  group_by(age_cat) %>%
  mutate(prop = n / sum(n))
#> # A tibble: 4 x 4
#> # Groups:   age_cat [2]
#>   age_cat      Sex        n  prop
#>   <chr>        <chr>  <int> <dbl>
#> 1 at least 35  female    81 0.345
#> 2 at least 35  male     154 0.655
#> 3 less than 35 female   180 0.376
#> 4 less than 35 male     299 0.624

However, if you're also introducing tidyr the following is also a good way of going about it:

library(tidyverse)
library(titanic)

titanic_train %>%
  filter(
    !is.na(Sex), 
    !is.na(Age)
  ) %>%
  mutate(age_cat = ifelse(Age >= 35, "at least 35", "less than 35")) %>%
  count(age_cat, Sex) %>%
  spread(Sex, n) %>%
  mutate(prop = female / (female + male))
#> # A tibble: 2 x 4
#>   age_cat      female  male  prop
#>   <chr>         <int> <int> <dbl>
#> 1 at least 35      81   154 0.345
#> 2 less than 35    180   299 0.376

adamgruer · February 13, 2018, 4:15am

Something I've just been learning about in a datacamp course is that you can take the mean of a logical vector to calculate proportion of a particular case.

library(tidyverse)
library(titanic)
tdf <- titanic_train #training set of Titanic data

tdf %>%
  filter(Age >= 35) %>%
  summarize(prob = mean(Sex == "female" , na.rm = T))

adamgruer · February 13, 2018, 4:26am

One other cool thing you can do is group by not just a variable but also an expression using that variable such as Age >= 35

library(tidyverse)
library(titanic)
tdf <- titanic_train #training set of Titanic data

tdf %>%
   group_by( Age >= 35) %>% 
   select(Sex) %>%
   table() %>%
   prop.table(1)