Time Series Subsetting by NA and length

technocrat · November 1, 2019, 12:45am

Hi, and welcome!

A reproducible example, called a reprex always attracts more answers, because it helps focus on problems without doing the set-up.

Your's can be very simple

df <- read.csv("https://gist.githubusercontent.com/technocrat/07eb05cb69cf17a1e2ce7bd87a70f9c8/raw/672c03cdd68de2a9ed92702e21a31212c802fba6/runs.csv", header = FALSE)
df
#>        V1 V2   V3
#> 1   0.104  0 31.6
#> 2   0.083  0 31.6
#> 3   0.002  0 31.6
#> 4  -0.060  0 31.6
#> 5  -0.048  0 31.6
#> 6   0.002  0 31.6
#> 7   0.021  0 31.8
#> 8   0.002  0 31.8
#> 9  -0.010  0 31.8
#> 10  0.002  0 31.8
#> 11  0.016  0 31.8
#> 12  0.007  0 31.8
#> 13 -0.009  0 31.8
#> 14 -0.012  0 31.8
#> 15 -0.004  0 31.8
#> 16 -0.001  0 31.8
#> 17 -0.004  0 31.8
#> 18 -0.004  0 31.8
#> 19     NA  0 31.8
#> 20     NA  0 31.8
#> 21 -0.009  0 31.8
#> 22 -0.012  0 31.8
#> 23 -0.004  0 31.8
#> 24 -0.001  0 31.8
#> 25 -0.004  0 31.8
#> 26 -0.004  0 31.8
#> 27     NA  0 31.8
#> 28  0.002  0 31.8
#> 29  0.016  0 31.8
#> 30  0.007  0 31.8
#> 31 -0.009  0 31.8
#> 32 -0.012  0 31.8
#> 33 -0.004  0 31.8
#> 34 -0.001  0 31.8
#> 35 -0.004  0 31.8
#> 36 -0.004  0 31.8
#> 37     NA  0 31.8

^{Created on 2019-10-31 by the reprex package (v0.3.0)}

My suggestion is to excise V1 to get a vector of num and NA, transform it into a logical (so that we just get TRUE/FALSE values for whether there's any data), and then use rle (run length encoding).

Here's what that gets you

df <- read.csv("https://gist.githubusercontent.com/technocrat/07eb05cb69cf17a1e2ce7bd87a70f9c8/raw/672c03cdd68de2a9ed92702e21a31212c802fba6/runs.csv", header = FALSE)
df
#>        V1 V2   V3
#> 1   0.104  0 31.6
#> 2   0.083  0 31.6
#> 3   0.002  0 31.6
#> 4  -0.060  0 31.6
#> 5  -0.048  0 31.6
#> 6   0.002  0 31.6
#> 7   0.021  0 31.8
#> 8   0.002  0 31.8
#> 9  -0.010  0 31.8
#> 10  0.002  0 31.8
#> 11  0.016  0 31.8
#> 12  0.007  0 31.8
#> 13 -0.009  0 31.8
#> 14 -0.012  0 31.8
#> 15 -0.004  0 31.8
#> 16 -0.001  0 31.8
#> 17 -0.004  0 31.8
#> 18 -0.004  0 31.8
#> 19     NA  0 31.8
#> 20     NA  0 31.8
#> 21 -0.009  0 31.8
#> 22 -0.012  0 31.8
#> 23 -0.004  0 31.8
#> 24 -0.001  0 31.8
#> 25 -0.004  0 31.8
#> 26 -0.004  0 31.8
#> 27     NA  0 31.8
#> 28  0.002  0 31.8
#> 29  0.016  0 31.8
#> 30  0.007  0 31.8
#> 31 -0.009  0 31.8
#> 32 -0.012  0 31.8
#> 33 -0.004  0 31.8
#> 34 -0.001  0 31.8
#> 35 -0.004  0 31.8
#> 36 -0.004  0 31.8
#> 37     NA  0 31.8
V1 <- df$V1
V1 <- !is.na(V1)
runs <- rle(V1)
runs
#> Run Length Encoding
#>   lengths: int [1:6] 18 2 6 1 9 1
#>   values : logi [1:6] TRUE FALSE TRUE FALSE TRUE FALSE

^{Created on 2019-10-31 by the reprex package (v0.3.0)}

With max(runs) you identify the longest sequence, then it's 'just' a matter of indexing.