I have the following dataset - students ("id") take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:

```
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
id results date_exam_taken exam_number
63018 1 0 2001-08-15 1
72324 1 1 2002-09-03 2
98866 1 0 2003-01-13 3
56137 1 1 2005-06-15 4
77746 1 0 2007-06-26 5
21438 1 0 2011-09-23 6
```

I then transformed the data into the following format:

```
library(tidyr)
my_data = my_data %>%
pivot_wider(id, names_from = "exam_number", values_from = "results")
# A tibble: 10,000 x 24
id `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18` `19` `20` `21` `22` `23`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0 0 0 1 0 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 2 1 0 1 1 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 3 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 NA NA NA NA NA
4 4 1 1 0 0 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 5 1 0 1 0 0 1 0 0 0 0 1 NA NA NA NA NA NA NA NA NA NA NA NA
6 6 1 1 0 1 1 0 0 1 0 0 1 NA NA NA NA NA NA NA NA NA NA NA NA
7 7 0 0 1 1 0 1 1 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 8 0 1 0 1 0 1 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
9 9 0 0 0 0 0 0 1 1 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA
10 10 0 0 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# ... with 9,990 more rows
```

Now, suppose I have the following sequences:

```
my_grid= expand.grid(0:1, 0:1, 0:1)
n = nrow(my_grid)
n = c(1:n)
my_grid$sequence = paste("sequence", n)
my_grid$seq = paste0(my_grid$Var1, my_grid$Var2, my_grid$Var3)
Var1 Var2 Var3 sequence seq
1 0 0 0 sequence 1 000
2 1 0 0 sequence 2 100
3 0 1 0 sequence 3 010
4 1 1 0 sequence 4 110
5 0 0 1 sequence 5 001
6 1 0 1 sequence 6 101
7 0 1 1 sequence 7 011
8 1 1 1 sequence 8 111
```

**GOAL: Within the entire dataset, I want to find out the number of times each sequence appears (at the row level). For example, given that a student in this population failed two consecutive tests (e.g. failed tests 4&5, failed test 1&2) - what is the probability that such a student will also fail the next test?**

I tried to approach this problem as follows - I took the exam scores of each students and concatenated them into a single string, and made this into a new row. This should make it easier to recognize a desired pattern:

```
my_list = list()
for (i in 1:length(1:nrow(my_data)))
{
val_i = paste(my_data[i,-1],collapse="")
print(val_i)
my_list[[i]] = val_i
}
my_data$cols <- my_list
my_fun <- function(seq, data){
return(lengths(gregexpr(seq, data)))
}
```

**PROBLEM: Then, I tried to apply this function to obtain the final counts - but I am getting this error:**

```
#PROBLEM
my_grid$counts = mapply(my_fun, c(my_grid$seq), my_data$cols)
Error in input[i, ] : incorrect number of dimensions
```

Ideally, I am looking for the final result to look something like this (from here, I could simply calculate the conditional probabilities):

```
# FINAL RESULT
Var1 Var2 Var3 sequence seq counts
1 0 0 0 sequence 1 000 ...
2 1 0 0 sequence 2 100 ...
3 0 1 0 sequence 3 010 ...
4 1 1 0 sequence 4 110 ...
5 0 0 1 sequence 5 001 ...
6 1 0 1 sequence 6 101 ...
7 0 1 1 sequence 7 011 ...
8 1 1 1 sequence 8 111 ...
```

**QUESTION: Can someone please show me what I am doing wrong and what I can do to fix this?**

Thanks!

- NOTE 1: Instead of using a function, I tried to do this with a for loop.

Here is the code I wrote:

```
my_list = list()
for (i in 1:length(my_grid$seq))
{
seq_i = my_grid$seq[i]
val_i = sum(lengths(gregexpr(seq_i, my_data$cols)))
print(c(i, seq_i, val_i))
}
[1] "1" "000" "11255"
[1] "2" "100" "12743"
[1] "3" "010" "12145"
[1] "4" "110" "12676"
[1] "5" "001" "12765"
[1] "6" "101" "12085"
[1] "7" "011" "12672"
[1] "8" "111" "11201"
```

But for some reason, I don't think this is correct (i.e. counts look rather high)?

- NOTE 2: I am also trying to make sure that the conditional probabilities are calculated using individual students scores and not by "clumping" all student scores together.

E.g.

```
student 1 = 1,1,0,0,1,0,0
student 2 = 1,0,0,1,1,1,0
```

It would be **incorrect** to combine the scores of both of these students into a single string `"1,1,0,0,1,0,0, 1,0,0,1,1,1,0"`

and then calculate the frequency counts - I would like to calculate these counts at the student level and then add them up together.