This is a problem involving run-length encoding—what is the length of occurrence of a value in a vector?
suppressPackageStartupMessages({
library(dplyr)
})
get_diff_dist <- function(x) make_runs(x)[2,3] - (make_runs(x)[1,3] + make_runs(x)[1,1])
get_dist <- function(x) make_runs(x)[2,3] - make_runs(x)[2,1] - make_runs(x)[1,3]
make_runs <- function(x) { tibble(lengths = rle(x)$lengths, values = rle(x)$values) %>%
mutate(indices = cumsum(lengths)) %>%
filter(lengths > 1 & values == 0)
}
# avoid df, data and other names that are also names of functions;
# some operations will give precedence to the built-in resulting
# in treating it as a closure and throwing an error
DF <- data.frame(
ID = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J" ),
Gender = c("M","M","F","M","F","F","F","M","F","F"),
y_0101 = c(0,2,0,0,0,3,0,0,0,3),
y_0102 = c(6,2,0,0,2,0,2,0,6,0),
y_0103 = c(0,0,1,0,0,0,0,4,0,0),
y_0104 = c(0,0,0,2,0,3,0,2,0,2),
y_0105 = c(2,2,1,4,5,3,4,0,5,2),
y_0106 = c(2,2,1,4,5,5,6,4,3,0),
y_0107 = c(2,2,1,4,0,0,6,0,0,4),
y_0108 = c(0,0,0,1,0,0,7,0,2,8),
y_0109 = c(2,8,0,0,0,0,0,3,0,0),
y_0110 = c(2,0,0,2,2,0,0,0,0,4),
y_0111 = c(0,0,0,0,0,0,0,0,0,0),
y_0112 = c(0,0,0,0,0,2,0,0,0,0))
m <- DF %>% select(where(is.numeric)) %>% as.matrix()
DIFF <- list()
for (i in 1:10) DIFF[i] = get_dist(m[i,])
DIFF <- unlist(DIFF)
tibble::add_column(DF,DIFF)
#> ID Gender y_0101 y_0102 y_0103 y_0104 y_0105 y_0106 y_0107 y_0108 y_0109
#> 1 A M 0 6 0 0 2 2 2 0 2
#> 2 B M 2 2 0 0 2 2 2 0 8
#> 3 C F 0 0 1 0 1 1 1 0 0
#> 4 D M 0 0 0 2 4 4 4 1 0
#> 5 E F 0 2 0 0 5 5 0 0 0
#> 6 F F 3 0 0 3 3 5 0 0 0
#> 7 G F 0 2 0 0 4 6 6 7 0
#> 8 H M 0 0 4 2 0 4 0 0 3
#> 9 I F 0 6 0 0 5 3 0 2 0
#> 10 J F 3 0 0 2 2 0 4 8 0
#> y_0110 y_0111 y_0112 DIFF
#> 1 2 0 0 6
#> 2 0 0 0 5
#> 3 0 0 0 5
#> 4 2 0 0 7
#> 5 2 0 0 2
#> 6 0 0 2 3
#> 7 0 0 0 4
#> 8 0 0 0 4
#> 9 0 0 0 4
#> 10 4 0 0 7
OP <- c(6,5,5,7,2,3,4,4,4,7)
OP == DIFF
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The last line compares the DIFF column in the image posted with the results of the preceding code, and shows no difference.
Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.
In this case, x is withinDF and y is OP. f is composite. Given OP, it is possible add it as a column to DF, but that is trivial, as shown in the code.
x has first to be extracted from DF, which is the office of
m <- DF %>% select(where(is.numeric)) %>% as.matrix()
m is of dimension 10, 12 and we will need some function that will treat each row separately, which is
for (i in 1:10) DIFF[i] = get_dist(m[i,])
DIFF here begins as an empty list and operates as a receiver object to accumulate the results of the application of the get_dist function.
get_dist is a composite function that extracts specific elements from the return value of make_runs, which returns an object in the form
> make_runs(m[1,])
# A tibble: 2 x 3
lengths values indices
<int> <dbl> <int>
1 2 0 4
2 2 0 12
through construction initially of a tibble containing return values of the rle function, which looks like
> m[1,] -> a
> attributes(a) <- NULL
> a
[1] 0 6 0 0 2 2 2 0 2 2 0 0
> rle(a)
Run Length Encoding
lengths: int [1:7] 1 1 2 3 1 2 2
values : num [1:7] 0 6 0 2 0 2 0
lengths is how many times the corresponding values is repeated. From there,
mutate(indices = cumsum(lengths))
provides the index of the end of each run, and
filter(lengths > 1 & values == 0)
narrows the results to runs of 2 or more values of 0.
Returning to get_dist, it was necessary to examine the image infer that the quantity desired was the number of columns desired was exclusive of the ending column of the first cohort and the beginning column of the last. The logic to derived DIFF2 was not similarly apparent.