For an introduction to probability, I am experimenting with using dplyr (well, tidyverse) to connect programming concepts to the idea of conditional probability. In my code below, I am using mutate to store numbers that I need later (simply the "numerator" and the "denominator"). My query is this: does anyone have a cleaner way of doing this calculation?
Example: Compute the probability that a randomly selected passenger on the Titanic was female given that the passenger was at least 35 years old.
library("tidyverse") #for data wrangling tools
library("titanic")
tdf <- titanic_train #training set of Titanic data
conditional_probability <- tdf %>%
filter(Age >= 35) %>%
mutate(denominator = n()) %>%
filter(Sex == "female") %>%
mutate(numerator = n()) %>%
summarize(unique(numerator/denominator))
This is a verbose, but not necessarily wrong approach. In fact might be a good idea for students familiar with concept of probability but learning their way with R / dplyr / tidyverse.
I was just teaching conditional probabilities today! I chose the following method
library(tidyverse)
library(titanic)
titanic_train %>%
filter(
!is.na(Sex),
!is.na(Age)
) %>%
mutate(age_cat = ifelse(Age >= 35, "at least 35", "less than 35")) %>%
count(age_cat, Sex) %>%
group_by(age_cat) %>%
mutate(prop = n / sum(n))
#> # A tibble: 4 x 4
#> # Groups: age_cat [2]
#> age_cat Sex n prop
#> <chr> <chr> <int> <dbl>
#> 1 at least 35 female 81 0.345
#> 2 at least 35 male 154 0.655
#> 3 less than 35 female 180 0.376
#> 4 less than 35 male 299 0.624
However, if you're also introducing tidyr the following is also a good way of going about it:
library(tidyverse)
library(titanic)
titanic_train %>%
filter(
!is.na(Sex),
!is.na(Age)
) %>%
mutate(age_cat = ifelse(Age >= 35, "at least 35", "less than 35")) %>%
count(age_cat, Sex) %>%
spread(Sex, n) %>%
mutate(prop = female / (female + male))
#> # A tibble: 2 x 4
#> age_cat female male prop
#> <chr> <int> <int> <dbl>
#> 1 at least 35 81 154 0.345
#> 2 less than 35 180 299 0.376
Something I've just been learning about in a datacamp course is that you can take the mean of a logical vector to calculate proportion of a particular case.
library(tidyverse)
library(titanic)
tdf <- titanic_train #training set of Titanic data
tdf %>%
filter(Age >= 35) %>%
summarize(prob = mean(Sex == "female" , na.rm = T))