I have the following dataset:
my_data = structure(list(Sequence = structure(1:8, .Label = c("HTT", "TTH",
"HHH", "HHT", "HTH", "THH", "TTT", "THT"), class = "factor"),
sums = c(93L, 93L, 112L, 106L, 108L, 97L, 94L, 97L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
> my_data
# A tibble: 8 x 2
Sequence sums
<fct> <int>
1 HTT 93
2 TTH 93
3 HHH 112
4 HHT 106
5 HTH 108
6 THH 97
7 TTT 94
8 THT 97
Using the information within the SUMS column, I want to find out the probability of the third flip being "H" vs "T" conditional on the earlier sequence (e.g. H given HH, H given TH, T given TT, etc.).**
I tried to do this with the DPLYR library:
library(dplyr)
my_data %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(sums)) %>%
mutate(prob = sums / sum(sums))
Here is the output of my code:
`summarise()` has grouped output by 'two_seq'. You can override using the `.groups` argument.
# A tibble: 8 x 4
# Groups: two_seq [4]
two_seq third sums prob
<chr> <chr> <int> <dbl>
1 HH H 112 0.514
2 HH T 106 0.486
3 HT H 108 0.537
4 HT T 93 0.463
5 TH H 97 0.5
6 TH T 97 0.5
7 TT H 93 0.497
8 TT T 94 0.503
Can someone please tell me if I have done this correctly?
Thanks!