First post, so forgive me if I break a few rules or if this is not the clearest example.
I have a few large datasets (~3.6 million rows each) which I am looking to subset based upon a few conditions.
The data has 3 columns we are interested in: X (numeric, a stationary time series), Activity (logical), and SignalStrength (numeric, between 0-100).
The column X occasionally contains NA values, the other two do not though.
I was looking for ways to find all sub-series within X that are of a specific length (in my case 600,000 rows) that do not contain NA values.
I looked into na.contiguous however, this only gives me the longest of such series, not all of the series. The other method I had been thinking about was to have some form of "rolling window" approach?
The reason I want to extract multiple series is that I will be comparing each of these series to find the ones which have the lowest number of TRUE values for 'Activity' and the highest average 'SignalStrength'.
I'm comfortable with the last two filtering functions, but am struggling with the initial subsetting based upon NAs and length.
Any help would be greatly appreciated. Here is a glimpse of what the data looks like:
My suggestion is to excise V1 to get a vector of num and NA, transform it into a logical (so that we just get TRUE/FALSE values for whether there's any data), and then use rle (run length encoding).
This is awesomely simply and just what I needed! Thanks as well for taking the time to show how to make the example reproducible.
Edit: the output from this is a list with length and values. max(run) would return the same type of result as na.contiguous from what I can tell. i.e. the longest sequence of !NA results.
What I am looking for though is for the ALL the sequences with no NAs in them (so that I can compare between them) - probably as indices.
So far as I can tell, the way to do this is to return the indices of the list items from the rle() function that pass a logical test (length > x & value==TRUE). Then, sum the lengths of all the list entries that come before each of these indices to get the start index of the sequence in the original dataframe.
Thought I'd share what I ended up doing for anyone who is working on a similar problem:
#convert the variable column to a logical vector based upon whether data is missing or present
V1_seqs <- df$V1
V1_seqs <- !is.na(V1_seqs)
#print out the runs of consecutive entries using run length encoding function and turn the result into a dataframe
runs <- rle(V1_seqs)
runs <- as.data.frame(unclass(runs))
#return the indices of all items in the 'runs' dataframe which have desireed length and contain data (values==TRUE)
seqIndices <- which(runs$lengths>-600000 & runs$values==TRUE)
#This gives us the indices of the items in the runs list, but not in the original dataframe next:
dfStartIndex <- c()
dfSeqLength <- c()
for (i in seqIndices) {
dfStartIndex[i] <- if (i>1) sum(runs$lengths[1:i-1]) else(1)
dfSeqLength[i] <- runs$lengths[i]
}
#bind the vectors into a dataframe containing the starting index and lengths of the sequences and remove NAs
dfExtracted <- na.omit(data.frame(dfStartIndex,dfSeqLength))
From there, the indices and lengths can be used to subset the original dataframe and extract the sequences.