Time Series Subsetting by NA and length

EJBailey · October 31, 2019, 11:35pm

Hi R Community,

First post, so forgive me if I break a few rules or if this is not the clearest example.

I have a few large datasets (~3.6 million rows each) which I am looking to subset based upon a few conditions.

The data has 3 columns we are interested in: X (numeric, a stationary time series), Activity (logical), and SignalStrength (numeric, between 0-100).

The column X occasionally contains NA values, the other two do not though.

I was looking for ways to find all sub-series within X that are of a specific length (in my case 600,000 rows) that do not contain NA values.

I looked into na.contiguous however, this only gives me the longest of such series, not all of the series. The other method I had been thinking about was to have some form of "rolling window" approach?

The reason I want to extract multiple series is that I will be comparing each of these series to find the ones which have the lowest number of TRUE values for 'Activity' and the highest average 'SignalStrength'.

I'm comfortable with the last two filtering functions, but am struggling with the initial subsetting based upon NAs and length.

Any help would be greatly appreciated. Here is a glimpse of what the data looks like:

X_716557 Activity_716557 SignalStrength_716557

1 0.104 0 31.6
2 0.083 0 31.6
3 0.002 0 31.6
4 -0.06 0 31.6
5 -0.048 0 31.6
6 0.002 0 31.6
7 0.021 0 31.8
8 0.002 0 31.8
9 -0.01 0 31.8
10 0.002 0 31.8
11 0.016 0 31.8
12 0.007 0 31.8
13 -0.009 0 31.8
14 -0.012 0 31.8
15 -0.004 0 31.8
16 -0.001 0 31.8
17 -0.004 0 31.8
18 -0.004 0 31.8
19 NA 0 31.8
20 NA 0 31.8

technocrat · November 1, 2019, 12:45am

Hi, and welcome!

A reproducible example, called a reprex always attracts more answers, because it helps focus on problems without doing the set-up.

Your's can be very simple

df <- read.csv("https://gist.githubusercontent.com/technocrat/07eb05cb69cf17a1e2ce7bd87a70f9c8/raw/672c03cdd68de2a9ed92702e21a31212c802fba6/runs.csv", header = FALSE)
df
#>        V1 V2   V3
#> 1   0.104  0 31.6
#> 2   0.083  0 31.6
#> 3   0.002  0 31.6
#> 4  -0.060  0 31.6
#> 5  -0.048  0 31.6
#> 6   0.002  0 31.6
#> 7   0.021  0 31.8
#> 8   0.002  0 31.8
#> 9  -0.010  0 31.8
#> 10  0.002  0 31.8
#> 11  0.016  0 31.8
#> 12  0.007  0 31.8
#> 13 -0.009  0 31.8
#> 14 -0.012  0 31.8
#> 15 -0.004  0 31.8
#> 16 -0.001  0 31.8
#> 17 -0.004  0 31.8
#> 18 -0.004  0 31.8
#> 19     NA  0 31.8
#> 20     NA  0 31.8
#> 21 -0.009  0 31.8
#> 22 -0.012  0 31.8
#> 23 -0.004  0 31.8
#> 24 -0.001  0 31.8
#> 25 -0.004  0 31.8
#> 26 -0.004  0 31.8
#> 27     NA  0 31.8
#> 28  0.002  0 31.8
#> 29  0.016  0 31.8
#> 30  0.007  0 31.8
#> 31 -0.009  0 31.8
#> 32 -0.012  0 31.8
#> 33 -0.004  0 31.8
#> 34 -0.001  0 31.8
#> 35 -0.004  0 31.8
#> 36 -0.004  0 31.8
#> 37     NA  0 31.8

^{Created on 2019-10-31 by the reprex package (v0.3.0)}

My suggestion is to excise V1 to get a vector of num and NA, transform it into a logical (so that we just get TRUE/FALSE values for whether there's any data), and then use rle (run length encoding).

Here's what that gets you

df <- read.csv("https://gist.githubusercontent.com/technocrat/07eb05cb69cf17a1e2ce7bd87a70f9c8/raw/672c03cdd68de2a9ed92702e21a31212c802fba6/runs.csv", header = FALSE)
df
#>        V1 V2   V3
#> 1   0.104  0 31.6
#> 2   0.083  0 31.6
#> 3   0.002  0 31.6
#> 4  -0.060  0 31.6
#> 5  -0.048  0 31.6
#> 6   0.002  0 31.6
#> 7   0.021  0 31.8
#> 8   0.002  0 31.8
#> 9  -0.010  0 31.8
#> 10  0.002  0 31.8
#> 11  0.016  0 31.8
#> 12  0.007  0 31.8
#> 13 -0.009  0 31.8
#> 14 -0.012  0 31.8
#> 15 -0.004  0 31.8
#> 16 -0.001  0 31.8
#> 17 -0.004  0 31.8
#> 18 -0.004  0 31.8
#> 19     NA  0 31.8
#> 20     NA  0 31.8
#> 21 -0.009  0 31.8
#> 22 -0.012  0 31.8
#> 23 -0.004  0 31.8
#> 24 -0.001  0 31.8
#> 25 -0.004  0 31.8
#> 26 -0.004  0 31.8
#> 27     NA  0 31.8
#> 28  0.002  0 31.8
#> 29  0.016  0 31.8
#> 30  0.007  0 31.8
#> 31 -0.009  0 31.8
#> 32 -0.012  0 31.8
#> 33 -0.004  0 31.8
#> 34 -0.001  0 31.8
#> 35 -0.004  0 31.8
#> 36 -0.004  0 31.8
#> 37     NA  0 31.8
V1 <- df$V1
V1 <- !is.na(V1)
runs <- rle(V1)
runs
#> Run Length Encoding
#>   lengths: int [1:6] 18 2 6 1 9 1
#>   values : logi [1:6] TRUE FALSE TRUE FALSE TRUE FALSE

^{Created on 2019-10-31 by the reprex package (v0.3.0)}

With max(runs) you identify the longest sequence, then it's 'just' a matter of indexing.

woodward · November 1, 2019, 1:13am

EJBailey · November 1, 2019, 2:10am

This is awesomely simply and just what I needed! Thanks as well for taking the time to show how to make the example reproducible.

Edit: the output from this is a list with length and values. max(run) would return the same type of result as na.contiguous from what I can tell. i.e. the longest sequence of !NA results.

What I am looking for though is for the ALL the sequences with no NAs in them (so that I can compare between them) - probably as indices.

So far as I can tell, the way to do this is to return the indices of the list items from the rle() function that pass a logical test (length > x & value==TRUE). Then, sum the lengths of all the list entries that come before each of these indices to get the start index of the sequence in the original dataframe.

technocrat · November 1, 2019, 2:58am

Take a whack at

lengths: int [1:6] 18 2 6 1 9 1

Can you pull out every other length and use those to create indices to identify the ranges for all the !NA row sequences?

(I'm only challenging you to stretch because you seem up to it!)

EJBailey · November 1, 2019, 3:02am

hey technocrat, thanks for that. I just edited the above comment with what I'm trying at the moment, I'll let you know if this works

technocrat · November 1, 2019, 3:04am

I think you're on the right track, there. C'mon back with a reprex if you get stuck.

EJBailey · November 3, 2019, 3:49am

Thought I'd share what I ended up doing for anyone who is working on a similar problem:

#convert the variable column to a logical vector based upon whether data is missing or present

V1_seqs <- df$V1
V1_seqs <- !is.na(V1_seqs)

#print out the runs of consecutive entries using run length encoding function and turn the result into a dataframe

runs <- rle(V1_seqs)
runs <- as.data.frame(unclass(runs))

#return the indices of all items in the 'runs' dataframe which have desireed length and contain data (values==TRUE)

seqIndices <- which(runs$lengths>-600000 & runs$values==TRUE)

#This gives us the indices of the items in the runs list, but not in the original dataframe next:

dfStartIndex <- c()
dfSeqLength <- c()

for (i in seqIndices) {
dfStartIndex[i] <- if (i>1) sum(runs$lengths[1:i-1]) else(1)
dfSeqLength[i] <- runs$lengths[i]
}

#bind the vectors into a dataframe containing the starting index and lengths of the sequences and remove NAs

dfExtracted <- na.omit(data.frame(dfStartIndex,dfSeqLength))

From there, the indices and lengths can be used to subset the original dataframe and extract the sequences.

Thanks again for your help guys!

technocrat · November 3, 2019, 10:26pm

Great! Could you please mark it as the solution? (No false modesty!). That will help others find the complete answer quicker.

system · November 10, 2019, 10:26pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.