From Sequence Counts to Variant Forecasts: A lineagefreq Tutorial

CuiweiG · April 12, 2026, 1:18pm

When a new pathogen variant emerges, public health teams need to estimate its growth advantage and forecast when it will dominate. The lineagefreq R package provides a reproducible workflow for these tasks using genomic surveillance count data.

Installation

install.packages("lineagefreq")

Working with Real CDC Data

The package ships with real CDC surveillance data. Here is an analysis of the JN.1 emergence in late 2023:

library(lineagefreq)
data(cdc_sarscov2_jn1)

x <- lfq_data(cdc_sarscov2_jn1, lineage = lineage, date = date, count = count)

Fitting a Model

fit <- fit_model(x, engine = "mlr")

This fits a multinomial logistic regression: each lineage gets an intercept and a growth rate, estimated by maximum likelihood.

Growth Advantages

growth_advantage(fit)
growth_advantage(fit, type = "relative_Rt", generation_time = 5)
growth_advantage(fit, type = "doubling_time")

A relative Rt of 1.3 means 30% more transmission per generation.

Forecasting with Uncertainty

fc <- forecast(fit, horizon = 28, n_sim = 1000)
autoplot(fc)

Honest Forecast Evaluation

bt <- backtest(x, engines = c("mlr", "piantham"),
               horizons = c(7, 14, 21, 28), min_train = 42)
sc <- score_forecasts(bt, metrics = c("mae", "coverage", "wis"))
compare_models(sc)

Rolling-origin backtesting avoids the common mistake of reporting in-sample fit as forecast accuracy.

Key Features

Five engines — MLR, hierarchical MLR, Piantham Rt conversion, and two Bayesian engines via Stan
Built-in backtesting — rolling-origin out-of-sample evaluation
Real data included — two CDC SARS-CoV-2 datasets for immediate validation
Broom integration — tidy(), glance(), augment() work as expected
Surveillance tools — sequencing_power() and summarize_emerging() for programme planning

What It Is Not

It is not a replacement for specialised phylodynamic tools (BEAST, Nextstrain's evofr). It is a lighter-weight, CRAN-distributed alternative for teams that need reproducible frequency analysis without setting up Stan infrastructure (though Bayesian engines are available if cmdstanr is installed).

Links

Happy to discuss methodology or answer questions.

system · July 11, 2026, 1:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.