How to incorporate disruption period length as an explanatory variable in linear regression?

nikos_geo · July 30, 2025, 3:57pm

I have a time series dataset spanning 72 months with a clear disruption period from month 26 to month 44. I'm analyzing the data by fitting separate linear models for three distinct periods:

Pre-disruption (months 0-25)

During-disruption (months 26-44)

Post-disruption (months 45-71)

For the during-disruption model, I want to include the length of the disruption period as an additional explanatory variable alongside time. I'm analyzing the impact of lockdown measures on nighttime lights, and I want to test whether the duration of the lockdown itself is a significant contributor to the observed changes. In this case, the disruption period length is 19 months (from month 26 to 44), but I have other datasets with different lockdown durations, and I hypothesize that longer lockdowns may have different impacts than shorter ones.

What's the appropriate way to incorporate known disruption duration into the analysis?

A little bit of context:

This is my approach for testing whether lockdown duration contributes to the magnitude of impact on nighttime lights (column ba in the shared df) during the lockdown period (knotsNum).

That's how I fitted the linear model for the during period without adding the length of the disruption period:

pre_data <- df[df$monthNum < knotsNum[1], ]
during_data <- df[df$monthNum >= knotsNum[1] & df$monthNum <= knotsNum[2], ]
post_data <- df[df$monthNum > knotsNum[2], ]

during_model <- lm(ba ~ monthNum, data = during_data)
summary(during_model)

Here is my dataset:

> dput(df)
structure(list(ba = c(75.5743196350863, 74.6203366002096, 73.6663535653328, 
72.8888364886628, 72.1113194119928, 71.4889580670178, 70.8665967220429, 
70.4616902716411, 70.0567838212394, 70.8242795722238, 71.5917753232083, 
73.2084886381771, 74.825201953146, 76.6378322273966, 78.4504625016473, 
80.4339255221286, 82.4173885426098, 83.1250549660005, 83.8327213893912, 
83.0952494240052, 82.3577774586193, 81.0798739040064, 79.8019703493935, 
78.8698515342936, 77.9377327191937, 77.4299978963597, 76.9222630735257, 
76.7886470146215, 76.6550309557173, 77.4315783782333, 78.2081258007492, 
79.6378781206591, 81.0676304405689, 82.5088809638169, 83.950131487065, 
85.237523842823, 86.5249161985809, 87.8695954274008, 89.2142746562206, 
90.7251944966818, 92.236114337143, 92.9680912967979, 93.7000682564528, 
93.2408108610688, 92.7815534656847, 91.942548368634, 91.1035432715832, 
89.7131675379257, 88.3227918042682, 86.2483383318464, 84.1738848594247, 
82.5152280388184, 80.8565712182122, 80.6045637522384, 80.3525562862646, 
80.5263796870851, 80.7002030879055, 80.4014140664706, 80.1026250450357, 
79.8140166545202, 79.5254082640047, 78.947577740372, 78.3697472167393, 
76.2917760563349, 74.2138048959305, 72.0960610901764, 69.9783172844223, 
67.8099702791755, 65.6416232739287, 63.4170169813438, 61.1924106887589, 
58.9393579024253), monthNum = 0:71), class = "data.frame", row.names = c(NA, 
-72L))

The disruption period:

knotsNum <- c(26,44)

Session info:

> sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone:
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.5.1    tools_4.5.1       rstudioapi_0.17.1

prubin · July 30, 2025, 8:05pm

I can't speak in detail on this -- it's outside my comfort zone -- but what you are looking at is studied under a variety of names, including "time series intervention analysis (TSIA)", "interrupted time series analysis (ITSA)", "intervention time series analysis (another ITSA)" and maybe a few other aliases. In my very limited experience, you typically look at a time series that starts in steady state and then incurs an interruption/intervention of some sort. In your case, you might need to look at two interruptions (the second being whatever ended the first interruption).

Besides typically assuming some sort of steady-state prior to the interruption, another thing of note is that TSIA/ITSA/whatever generally involves either incorporation of autocorrelation, trend and where applicable seasonality or at least initial testing to confirm the lack of those factors. This is because, at the risk of possibly oversimplifying things, time series observations are not typically an i.i.d. sample.

nikos_geo · July 30, 2025, 8:27pm

I'm aware of the method you mentioned, I tested it (although I didn't show that in my post). The data do "suffer" from autocorrelation (again based on personal tests not showed in the post).

I don't know if it's possible (or not) to, somehow, include the length of the disruption period as covariate without having to perform ITSA.

For instance, if we assume I don't have the other two periods (pre and post), and I only have the during. How one would incorporate the length of the period into the linear model?

prubin · July 30, 2025, 8:37pm

Without knowing more about the variables and your hypotheses, it's hard to give a useful answer. Assuming that the effect of the disruption is not predicated on advanced knowledge of the duration (e.g., you're in a pandemic lockdown and you have no idea how long it will last), you can use time since start of disruption as a predictor. If you think the duration of disruption is known in advance (e.g., start/end dates for summer semester), I would think that each disruption would only produce one observation, and you would need multiple disruptions (e.g., comparing a number of summer semesters of different durations, each yielding one observation) to get anywhere. All of this is obvious, so probably not what you are asking.

startz · July 30, 2025, 11:13pm

If I'm following correctly what you and @prubin have said, the time since disruption is exactly collinear with knotsNum. That means you can't get a separate trend for monthNum and knotsNum because they are identical (up to a constant).

nikos_geo · July 31, 2025, 8:11am

Thank you both for the clarification!
@prubin - You're absolutely right that I need multiple disruptions. I do have 58 cities in total with different lockdown durations, so I have the variation needed. Each city has the same time series structure but different lockdown start/end dates.

@startz - You've identified the exact issue I was struggling with! Within a single city, monthNum and lockdown duration are indeed perfectly collinear during the disruption period. I was trying to include lockdown duration as a predictor within each city's during-disruption model, which obviously can't work.

My setup:

58 cities with varying lockdown durations
Each city: monthly nighttime lights data (Jan 2018 - Dec 2023)
Different lockdown periods per city (some 6 months, others 18+ months)
Hypothesis: longer lockdowns → larger drops in nighttime lights

Next steps I'm considering:

Pool the during-disruption data from all 58 cities
Include both monthNum (time trend) and lockdown_duration (calculated from the knotsNum) as predictors (that's my main question after our discussion, how to do this?)
Test if lockdown duration significantly affects the magnitude of impact

Would this approach resolve the collinearity issue? I could share a sample of my CSV files if that would help clarify the data structure.

startz · July 31, 2025, 12:58pm

That should work. I'll be curious to see what you get.

nikos_geo · July 31, 2025, 2:25pm

OK so for the various cities I fitted a linear model for the during- period using all the available observations and lockdown periods. I did this for 3 different land use (LU) classes that I have (basically, aggregated NTL at the polygon (i.e., LU) level).

The results are:

Industrial

Call:
lm(formula = trend ~ monthNum + lockdown_duration, data = all_during_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.351 -30.996  -8.828  25.195 141.459 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        66.2404     4.0716  16.269   <2e-16 ***
monthNum           -0.2024     0.1205  -1.680   0.0931 .  
lockdown_duration  -0.9974     0.1110  -8.986   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 34.26 on 1699 degrees of freedom
Multiple R-squared:  0.06502,	Adjusted R-squared:  0.06392 
F-statistic: 59.08 on 2 and 1699 DF,  p-value: < 2.2e-16

=== DATA SUMMARY ===
Total observations: 1702 
Cities included: 52 
Lockdown duration range: 2 - 36 months

Retail

Call:
lm(formula = trend ~ monthNum + lockdown_duration, data = all_during_data)

Residuals:
   Min     1Q Median     3Q    Max 
-51.09 -30.93 -15.86  27.81  96.42 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        62.2287     6.0925  10.214  < 2e-16 ***
monthNum           -0.1841     0.1832  -1.005    0.315    
lockdown_duration  -0.8575     0.1694  -5.062  5.1e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 35.8 on 849 degrees of freedom
Multiple R-squared:  0.04344,	Adjusted R-squared:  0.04119 
F-statistic: 19.28 on 2 and 849 DF,  p-value: 6.477e-09

=== DATA SUMMARY ===
Total observations: 852 
Cities included: 41 
Lockdown duration range: 2 - 36 months

Airport

Call:
lm(formula = trend ~ monthNum + lockdown_duration, data = all_during_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.873 -29.924  -3.642  25.565  93.385 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       48.06517    7.47486   6.430 3.23e-10 ***
monthNum          -0.32711    0.22650  -1.444    0.149    
lockdown_duration  0.07157    0.20690   0.346    0.730    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 33.84 on 454 degrees of freedom
Multiple R-squared:  0.004886,	Adjusted R-squared:  0.0005022 
F-statistic: 1.115 on 2 and 454 DF,  p-value: 0.329

=== DATA SUMMARY ===
Total observations: 457 
Cities included: 28 
Lockdown duration range: 2 - 36 months

It seems that the length of the lockdown period is significant for 2/3 LU classes. One the other hand, the R2 is very low.

The code I used (only for the airport data as an eample):

library(dplyr)

# function to convert date to monthNum (0 = Jan 2018, 71 = Dec 2023), see below
date_to_monthNum <- function(date_string) {
  if (is.na(date_string)) return(NA)
  date_obj <- as.Date(date_string, format = "%d-%m-%y")
  
  # calculate months difference from Jan 2018
  months_diff <- (as.numeric(format(date_obj, "%Y")) - 2018) * 12 + 
    (as.numeric(format(date_obj, "%m")) - 1)
  
  return(months_diff)
}

# df containing lockdown dates and converts them to monthNum
lockdown_df <- read.csv("path/lockdown_df.csv", stringsAsFactors = FALSE)
lockdown_df <- lockdown_df %>%
  mutate(
    start_monthNum = sapply(start_date, date_to_monthNum),
    end_monthNum = sapply(end_date, date_to_monthNum),
    lockdown_duration = end_monthNum - start_monthNum + 1
  )

# main folder path, i.e., where the actual NTL data are
main_folder <- "path"

# initialize pooled data
all_during_data <- data.frame()
skipped_cities <- character()

# get all city folders
city_folders <- list.dirs(main_folder, recursive = FALSE, full.names = FALSE)

# process each city
for (city in city_folders) {
  # get lockdown info for this city
  city_lockdown <- lockdown_df[lockdown_df$csv_name == city, ]
  
  # skip cities with no lockdown data
  if (nrow(city_lockdown) == 0 || is.na(city_lockdown$start_monthNum[1]) || is.na(city_lockdown$end_monthNum[1])) {
    skipped_cities <- c(skipped_cities, city)
    next
  }
  
  start_month <- city_lockdown$start_monthNum[1]
  end_month <- city_lockdown$end_monthNum[1]
  duration <- city_lockdown$lockdown_duration[1]
  
  # path to airport LU
  airport_path <- file.path(main_folder, city, "lu", "airport", "dissolved.csv") # airport, retail, industrial
  
  if (!file.exists(airport_path)) {
    skipped_cities <- c(skipped_cities, paste(city, "(no airport data)"))
    next
  }
  
  # ead airport data
  airport_data <- read.csv(airport_path, stringsAsFactors = FALSE)
  
  # Aadd monthNum column
  airport_data <- airport_data %>%
    mutate(monthNum = 0:(nrow(airport_data)-1))
  
  # filter during-lockdown period
  during_data <- airport_data %>%
    filter(monthNum >= start_month & monthNum <= end_month) %>%
    mutate(
      city = city,
      lockdown_duration = duration
    )
  
  # add to pooled data
  all_during_data <- rbind(all_during_data, during_data)
}

# lm fit pooled model
if (nrow(all_during_data) > 0) {
  model <- lm(trend ~ monthNum + lockdown_duration, data = all_during_data)
  
  cat("=== POOLED AIRPORT MODEL (DURING LOCKDOWN) ===\n")
  print(summary(model))
  
  cat("\n=== DATA SUMMARY ===\n")
  cat("Total observations:", nrow(all_during_data), "\n")
  cat("Cities included:", length(unique(all_during_data$city)), "\n")
  cat("Lockdown duration range:", min(all_during_data$lockdown_duration), "-", max(all_during_data$lockdown_duration), "months\n")
} else {
  cat("No data available for analysis\n")
}

What do you think?

startz · July 31, 2025, 2:35pm

Is trend basically what you called ba earlier?

nikos_geo · July 31, 2025, 3:04pm

Yes, the trend represents the NTL (what I called ba earlier). I am sorry for not saying it.

prubin · July 31, 2025, 3:09pm

Without knowing your definition of (and measure for) "nighttime lights" it's hard to say, but I have a strong suspicion that there is a seasonal component to "nighttime lights" (particularly since the length of the night varies with the calendar).

nikos_geo · July 31, 2025, 3:31pm

So the NTL is the average value of all pixels within the LU (say airport for example). So I have 72 monthly values. Moreover, I performed STL (seasonal trend decomposition using LOESS) to extract the trend component and that's what you see in the code (column trend). So the seasonality has been removed.

startz · July 31, 2025, 3:51pm

It looks like duration matters for everything other than airports. While I don't know the subject matter, I wouldn't be surprised if airports left the lights on while other places did not.

One thing that you might want to do is to allow a separate intercept for each city, i.e. add fixed effects. You can do this by coding the city as a factor.

nikos_geo · July 31, 2025, 3:59pm

You mean something like model <- lm(trend ~ monthNum + lockdown_duration + factor(city), data = all_during_data)?

startz · July 31, 2025, 5:05pm

Exactly .
(Here are some random characters do get around the 20 character minimum for a reply )

system · October 29, 2025, 5:05pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.