How to use `EmbedDimension()` of the rEDM package in the case of imputation of missing data in time series?

My goal is to fill data gaps in time series (fish count in different streams) using Empirical dynamic modelling (EDM) and the rEDM package. I can use the Simplex() function for that purpose (by iterating through each data gap sequentially) but I am struggling to use the EmbedDimension() to determine the optimal embedding dimension E because it is unclear how the column time is used: are the actual values (e.g. 2001, 2022, 2003, etc.) considered or are consecutive rows considered consecutive time steps? In other words, can I provide a dataset with missing time steps? And finally, what's the best approach here (those questions are asked again at the end of this message).

Here is a reproducible example:

library(rEDM)
library(dplyr)

set.seed(100)

# Create fish abundance data for 3 streams with some similar cyclic dynamics

years <- 1960:2025
n_years <- length(years)

# Generate base dynamics (shared environmental driver, which is not included in the analysis)
base_signal <- sin(2*pi*(1:length(years))/20) + rnorm(length(years), 0, 0.2) #  sin(2 * pi * (1:n_years) / 8) + 0.5 * sin(2 * pi * (1:n_years) / 4)
noise_level <- 0.9

# Create 3 correlated time series
stream_data <- data.frame(
  year = years,
  stream1 = 1000 + 500 * base_signal + rnorm(n_years, 0, noise_level * 200),
  stream2 = 800 + 400 * base_signal + rnorm(n_years, 0, noise_level * 150),
  stream3 = 1200 + 600 * base_signal + rnorm(n_years, 0, noise_level * 250)
)

# Introduce gaps in stream1
data_gap1 <- stream_data
data_gap1$stream1[c(5:7,20,23,47:55,60)] <- NA

data_gap1

Make a function that receives a vector, dataframe or matrix containing missing data and returns a data frame of two columns, where each row contains the first and last position or row of a data gap. In case of a gap made of a single missing value, position is repeated twice in the row:

get_gap_positions <- function(x, NA_focus = T){
  
  if(NA_focus){
    if(class(x) %in% c("data.frame","matrix")){
      positions <- which(is.na(rowSums(x)))
    }else{
      positions <- which(is.na(x))
    }
    colnames <- c("start_NA","end_NA")
  }else{
    if(class(x) %in% c("data.frame","matrix")){
      positions <- which(!is.na(rowSums(x)))
    }else{
      positions <- which(!is.na(x))
    }
    colnames <- c("start_data","end_data")
  }
  
  if(length(positions) == 0){
    return(data.frame(gap_start = integer(0), gap_end = integer(0)))
    
  }else{
    # Find where consecutive groups break
    breaks <- c(0, 
                which(diff(positions) > 1), # where gaps are
                length(positions))
    
    start <- positions[breaks[-length(breaks)] + 1]
    end   <- positions[breaks[-1]]
    
    out <- data.frame(start,end)
    
    colnames(out) <- colnames
    
    return(out)
  }
}

Create the matrix where each row is an data gap interval in stream1 (argument pred):

pred <- get_gap_positions(data_gap1$stream1) # function to return the rows of the extremities of a data gap

#' for row corresponding to a single NA, modify the first value by taking the previous
#' position because Simplex() needs a consecutive pair of positions.

cond <- pred$start_NA == pred$end_NA
pred$start_NA[cond] <- pred$end_NA[cond] - 1

pred

Create the matrix where each row is a data interval with available data (argument lib):

lib <- get_gap_positions(data_gap1[,setdiff(colnames(data_gap1),"year")], NA_focus = F)

cond <- lib$start_data == lib$end_data
lib$start_data[cond] <- lib$end_data[cond] - 1

lib

Here below is an illustration of how I use Simplex() to fill the 1st data gap in stream1 with E arbitrarily set to 3:

Simplex(
  dataFrame = data_gap1,
  lib       = lib,
  pred      = pred[1,],        
  target    = "stream1",
  columns   = "stream2 stream3",
  E         = 3,           
  Tp        = 0,            
  tau       = -1,          
  embedded  = FALSE )

I want to find the optimal E with EmbedDimension(), and here is the result with the first data interval:

EmbedDimension(
  dataFrame = data_gap1,
  lib       = lib,
  pred      = lib[1,],
  target    = "stream1",
  columns   = "stream2 stream3",
  maxE      = 4,  
  Tp        = 0,
  tau       = -1,
  showPlot  = TRUE)

Now we can also pass the entire lib, but that does not seem to be the way to go considering the WARNING message: "Disjoint prediction sets are not fully supported. Use with caution":

EmbedDimension(
  dataFrame = data_gap1,
  lib       = lib,
  pred      = lib,
  target    = "stream1",
  columns   = "stream2 stream3",
  maxE      = 4,  
  Tp        = 0,
  tau       = -1,
  showPlot  = TRUE)

Notice that the output is also different from using lib[1,] so it is unclear what is being considered for pred.

I tried removing the rows with NAs entirely and passing all the rows to pred, which produces a different result and no warning:

cond <- apply(data_gap1[,setdiff(colnames(data_gap1),"year")],1,function(r){all(!is.na(r))})
data <- data_gap1[cond,]

EmbedDimension(
  dataFrame = data,
  lib       = c(1,nrow(data)),         
  pred      = c(1,nrow(data)),     
  target    = "stream1",
  columns   = "stream2 stream3",
  maxE      = 4,
  Tp        = 0,
  tau       = -1,
  showPlot  = T)

But I was not sure if the column year was used to account for the missing years or not so I shuffled the row, and got a different result:

data_shuffle <- data[sample(x = 1:nrow(data),size = nrow(data),replace = F),]

EmbedDimension(
  dataFrame = data_shuffle,
  lib       = c(1,nrow(data_shuffle)),
  pred      = c(1,nrow(data_shuffle)),
  target    = "stream1",
  columns   = "stream2 stream3",
  maxE = 4,
  tau       = -1,
  Tp        = 0,           
  showPlot = T)

Question 1:

This suggests that the year values are not considered and instead that consecutive rows are considered consecutive years? Is that correct?

Question 2:

If that's the case then removing the rows/years with missing data is not the solution, correct?

Question 3:

If that is the case, I see two alternatives to estimate the optimal E in my case:

  1. estimate E for the longest data interval only
  2. estimate E for each data interval and then average the E values, weighting according to the relative number of data points in each interval

Make sense?

Thank you :slight_smile:

1 Like