My goal is to fill data gaps in time series (fish count in different streams) using Empirical dynamic modelling (EDM) and the rEDM package. I can use the Simplex() function for that purpose (by iterating through each data gap sequentially) but I am struggling to use the EmbedDimension() to determine the optimal embedding dimension E because it is unclear how the column time is used: are the actual values (e.g. 2001, 2022, 2003, etc.) considered or are consecutive rows considered consecutive time steps? In other words, can I provide a dataset with missing time steps? And finally, what's the best approach here (those questions are asked again at the end of this message).
Here is a reproducible example:
library(rEDM)
library(dplyr)
set.seed(100)
# Create fish abundance data for 3 streams with some similar cyclic dynamics
years <- 1960:2025
n_years <- length(years)
# Generate base dynamics (shared environmental driver, which is not included in the analysis)
base_signal <- sin(2*pi*(1:length(years))/20) + rnorm(length(years), 0, 0.2) # sin(2 * pi * (1:n_years) / 8) + 0.5 * sin(2 * pi * (1:n_years) / 4)
noise_level <- 0.9
# Create 3 correlated time series
stream_data <- data.frame(
year = years,
stream1 = 1000 + 500 * base_signal + rnorm(n_years, 0, noise_level * 200),
stream2 = 800 + 400 * base_signal + rnorm(n_years, 0, noise_level * 150),
stream3 = 1200 + 600 * base_signal + rnorm(n_years, 0, noise_level * 250)
)
# Introduce gaps in stream1
data_gap1 <- stream_data
data_gap1$stream1[c(5:7,20,23,47:55,60)] <- NA
data_gap1
Make a function that receives a vector, dataframe or matrix containing missing data and returns a data frame of two columns, where each row contains the first and last position or row of a data gap. In case of a gap made of a single missing value, position is repeated twice in the row:
get_gap_positions <- function(x, NA_focus = T){
if(NA_focus){
if(class(x) %in% c("data.frame","matrix")){
positions <- which(is.na(rowSums(x)))
}else{
positions <- which(is.na(x))
}
colnames <- c("start_NA","end_NA")
}else{
if(class(x) %in% c("data.frame","matrix")){
positions <- which(!is.na(rowSums(x)))
}else{
positions <- which(!is.na(x))
}
colnames <- c("start_data","end_data")
}
if(length(positions) == 0){
return(data.frame(gap_start = integer(0), gap_end = integer(0)))
}else{
# Find where consecutive groups break
breaks <- c(0,
which(diff(positions) > 1), # where gaps are
length(positions))
start <- positions[breaks[-length(breaks)] + 1]
end <- positions[breaks[-1]]
out <- data.frame(start,end)
colnames(out) <- colnames
return(out)
}
}
Create the matrix where each row is an data gap interval in stream1 (argument pred):
pred <- get_gap_positions(data_gap1$stream1) # function to return the rows of the extremities of a data gap
#' for row corresponding to a single NA, modify the first value by taking the previous
#' position because Simplex() needs a consecutive pair of positions.
cond <- pred$start_NA == pred$end_NA
pred$start_NA[cond] <- pred$end_NA[cond] - 1
pred
Create the matrix where each row is a data interval with available data (argument lib):
lib <- get_gap_positions(data_gap1[,setdiff(colnames(data_gap1),"year")], NA_focus = F)
cond <- lib$start_data == lib$end_data
lib$start_data[cond] <- lib$end_data[cond] - 1
lib
Here below is an illustration of how I use Simplex() to fill the 1st data gap in stream1 with E arbitrarily set to 3:
Simplex(
dataFrame = data_gap1,
lib = lib,
pred = pred[1,],
target = "stream1",
columns = "stream2 stream3",
E = 3,
Tp = 0,
tau = -1,
embedded = FALSE )
I want to find the optimal E with EmbedDimension(), and here is the result with the first data interval:
EmbedDimension(
dataFrame = data_gap1,
lib = lib,
pred = lib[1,],
target = "stream1",
columns = "stream2 stream3",
maxE = 4,
Tp = 0,
tau = -1,
showPlot = TRUE)
Now we can also pass the entire lib, but that does not seem to be the way to go considering the WARNING message: "Disjoint prediction sets are not fully supported. Use with caution":
EmbedDimension(
dataFrame = data_gap1,
lib = lib,
pred = lib,
target = "stream1",
columns = "stream2 stream3",
maxE = 4,
Tp = 0,
tau = -1,
showPlot = TRUE)
Notice that the output is also different from using lib[1,] so it is unclear what is being considered for pred.
I tried removing the rows with NAs entirely and passing all the rows to pred, which produces a different result and no warning:
cond <- apply(data_gap1[,setdiff(colnames(data_gap1),"year")],1,function(r){all(!is.na(r))})
data <- data_gap1[cond,]
EmbedDimension(
dataFrame = data,
lib = c(1,nrow(data)),
pred = c(1,nrow(data)),
target = "stream1",
columns = "stream2 stream3",
maxE = 4,
Tp = 0,
tau = -1,
showPlot = T)
But I was not sure if the column year was used to account for the missing years or not so I shuffled the row, and got a different result:
data_shuffle <- data[sample(x = 1:nrow(data),size = nrow(data),replace = F),]
EmbedDimension(
dataFrame = data_shuffle,
lib = c(1,nrow(data_shuffle)),
pred = c(1,nrow(data_shuffle)),
target = "stream1",
columns = "stream2 stream3",
maxE = 4,
tau = -1,
Tp = 0,
showPlot = T)
Question 1:
This suggests that the year values are not considered and instead that consecutive rows are considered consecutive years? Is that correct?
Question 2:
If that's the case then removing the rows/years with missing data is not the solution, correct?
Question 3:
If that is the case, I see two alternatives to estimate the optimal E in my case:
- estimate
Efor the longest data interval only - estimate
Efor each data interval and then average theEvalues, weighting according to the relative number of data points in each interval
Make sense?
Thank you ![]()