Conditional Probabilities in R

omario · January 26, 2023, 5:21pm

I have the following dataset:

my_data = structure(list(Sequence = structure(1:8, .Label = c("HTT", "TTH", 
"HHH", "HHT", "HTH", "THH", "TTT", "THT"), class = "factor"), 
    sums = c(93L, 93L, 112L, 106L, 108L, 97L, 94L, 97L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -8L))


> my_data
# A tibble: 8 x 2
  Sequence  sums
  <fct>    <int>
1 HTT         93
2 TTH         93
3 HHH        112
4 HHT        106
5 HTH        108
6 THH         97
7 TTT         94
8 THT         97

Using the information within the SUMS column, I want to find out the probability of the third flip being "H" vs "T" conditional on the earlier sequence (e.g. H given HH, H given TH, T given TT, etc.).**

I tried to do this with the DPLYR library:

    library(dplyr)
my_data %>%
    mutate(two_seq = substr(Sequence, 1, 2)) %>%
    group_by(two_seq) %>%
    mutate(third = substr(Sequence, 3, 3)) %>%
    group_by(two_seq, third) %>%
    summarize(sums = sum(sums)) %>%
    mutate(prob = sums / sum(sums))

Here is the output of my code:

`summarise()` has grouped output by 'two_seq'. You can override using the `.groups` argument.
# A tibble: 8 x 4
# Groups:   two_seq [4]
  two_seq third  sums  prob
  <chr>   <chr> <int> <dbl>
1 HH      H       112 0.514
2 HH      T       106 0.486
3 HT      H       108 0.537
4 HT      T        93 0.463
5 TH      H        97 0.5  
6 TH      T        97 0.5  
7 TT      H        93 0.497
8 TT      T        94 0.503

Can someone please tell me if I have done this correctly?

Thanks!

FJCC · January 26, 2023, 6:04pm

You get the right answer but the process has unnecessary steps and only works because summarize() groups its output by two_seq. Here is a comparison of your code and a simplified version.

my_data = structure(list(Sequence = structure(1:8, .Label = c("HTT", "TTH", 
                                                              "HHH", "HHT", "HTH", "THH", "TTT", "THT"), class = "factor"), 
                         sums = c(93L, 93L, 112L, 106L, 108L, 97L, 94L, 97L)), class = c("tbl_df", 
                                                                                         "tbl", "data.frame"), row.names = c(NA, -8L))
library(dplyr)

my_data
#> # A tibble: 8 × 2
#>   Sequence  sums
#>   <fct>    <int>
#> 1 HTT         93
#> 2 TTH         93
#> 3 HHH        112
#> 4 HHT        106
#> 5 HTH        108
#> 6 THH         97
#> 7 TTT         94
#> 8 THT         97

#Original code
my_data %>%
  mutate(two_seq = substr(Sequence, 1, 2)) %>%
  group_by(two_seq) %>%
  mutate(third = substr(Sequence, 3, 3)) %>%
  group_by(two_seq, third) %>%
  summarize(sums = sum(sums)) %>%
  mutate(prob = sums / sum(sums))
#> `summarise()` has grouped output by 'two_seq'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups:   two_seq [4]
#>   two_seq third  sums  prob
#>   <chr>   <chr> <int> <dbl>
#> 1 HH      H       112 0.514
#> 2 HH      T       106 0.486
#> 3 HT      H       108 0.537
#> 4 HT      T        93 0.463
#> 5 TH      H        97 0.5  
#> 6 TH      T        97 0.5  
#> 7 TT      H        93 0.497
#> 8 TT      T        94 0.503

my_data %>%
  mutate(two_seq = substr(Sequence, 1, 2)) %>%
  #group_by(two_seq) %>%
  mutate(third = substr(Sequence, 3, 3)) %>%
  #group_by(two_seq, third) %>%
  #summarize(sums = sum(sums)) %>%
  group_by(two_seq) |> 
  mutate(prob = sums / sum(sums)) |> 
  arrange(two_seq) #This line makes comparing to the original result easier
#> # A tibble: 8 × 5
#> # Groups:   two_seq [4]
#>   Sequence  sums two_seq third  prob
#>   <fct>    <int> <chr>   <chr> <dbl>
#> 1 HHH        112 HH      H     0.514
#> 2 HHT        106 HH      T     0.486
#> 3 HTT         93 HT      T     0.463
#> 4 HTH        108 HT      H     0.537
#> 5 THH         97 TH      H     0.5  
#> 6 THT         97 TH      T     0.5  
#> 7 TTH         93 TT      H     0.497
#> 8 TTT         94 TT      T     0.503

^{Created on 2023-01-26 with reprex v2.0.2}

system · March 9, 2023, 6:05pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.