From Sequence Counts to Variant Forecasts: A lineagefreq Tutorial

When a new pathogen variant emerges, public health teams need to estimate its growth advantage and forecast when it will dominate. The lineagefreq R package provides a reproducible workflow for these tasks using genomic surveillance count data.

Installation

install.packages("lineagefreq")

Working with Real CDC Data

The package ships with real CDC surveillance data. Here is an analysis of the JN.1 emergence in late 2023:

library(lineagefreq)
data(cdc_sarscov2_jn1)

x <- lfq_data(cdc_sarscov2_jn1, lineage = lineage, date = date, count = count)

Fitting a Model

fit <- fit_model(x, engine = "mlr")

This fits a multinomial logistic regression: each lineage gets an intercept and a growth rate, estimated by maximum likelihood.

Growth Advantages

growth_advantage(fit)
growth_advantage(fit, type = "relative_Rt", generation_time = 5)
growth_advantage(fit, type = "doubling_time")

A relative Rt of 1.3 means 30% more transmission per generation.

Forecasting with Uncertainty

fc <- forecast(fit, horizon = 28, n_sim = 1000)
autoplot(fc)

Honest Forecast Evaluation

bt <- backtest(x, engines = c("mlr", "piantham"),
               horizons = c(7, 14, 21, 28), min_train = 42)
sc <- score_forecasts(bt, metrics = c("mae", "coverage", "wis"))
compare_models(sc)

Rolling-origin backtesting avoids the common mistake of reporting in-sample fit as forecast accuracy.

Key Features

  • Five engines — MLR, hierarchical MLR, Piantham Rt conversion, and two Bayesian engines via Stan
  • Built-in backtesting — rolling-origin out-of-sample evaluation
  • Real data included — two CDC SARS-CoV-2 datasets for immediate validation
  • Broom integrationtidy(), glance(), augment() work as expected
  • Surveillance toolssequencing_power() and summarize_emerging() for programme planning

What It Is Not

It is not a replacement for specialised phylodynamic tools (BEAST, Nextstrain's evofr). It is a lighter-weight, CRAN-distributed alternative for teams that need reproducible frequency analysis without setting up Stan infrastructure (though Bayesian engines are available if cmdstanr is installed).

Links

Happy to discuss methodology or answer questions.