Pre-processing for spectral data

Hi

I have looked through the recipes package for pre-processing methods used for spectral data but I (with a few exceptions) cannot find any of the most used types of pre-processing in spectroscopy. I'm looking for pre-processing method such as:

Standard Normal Variate (SNV),
Multiplicative Scatter Correction (MSC)
Normalisation of spectra (area, length, Sum)
Baseline corrections (Automatic Weighted Least Squares, Automatic Whittaker Filter, Asymmetric Least Squares)
Savitzky-Golay filter (sliding window + derivative)

Am I missing the functions or are they missing from recipes?
Assuming they are missing, are anyone aware of a package that have these pre-processing methods, that also integrates well with tidymodels?

BR

James Wade and I have been (intermittently) working on a package to do this called measure. It's still in development and the API will change a little but it does work and we have implemented Savitzky-Golay. I'm using it for publications so it will be a serious thing, just low on the priority list at this particular moment.

There are input steps that arrange the spectra in an internal tidy format and steps that can make them available to the model in either a wide or long format.

Here's a small example that can go from wide to wide with SG in between:

library(tidymodels)
# pak::pak(c("JamesHWade/measure"), ask = FALSE)
library(measure)
#> Registered S3 method overwritten by 'measure':
#>   method                    from   
#>   required_pkgs.step_isomap recipes

data(meats, package = "modeldata")

meats <- 
  meats %>% 
  # Only predict one outcome
  select(-fat, -protein) %>% 
  # Add a sample identifier called ".row"
  add_rowindex() %>% 
  relocate(water, .row)

set.seed(1)
meat_split <- initial_split(meats, strata = water)
meat_train <- training(meat_split)
meat_test  <- testing(meat_split)
meat_rs <- vfold_cv(meat_train)
# Make the recipe

sg_rec <-
  recipe(water ~ ., data = meat_train) %>%
  # Make sure that the sample indicator is not treated as a predictor
  update_role(.row, new_role = "id") %>%
  # Since the spectra are currently "wide" use this step to convert to
  # the internal format
  step_measure_input_wide(starts_with("x_")) %>%
  # Preprocess the spectra. These arguments can be tuned
  step_measure_savitzky_golay(
    differentiation_order = 1,
    degree = 3,
    # "window side" is how many points to each side. The window
    # size is 2 * window_side + 1
    window_side = 5
  ) %>% 
  # Put them back into a wide format. The default prefixes for the 
  # variables is "measure_
  step_measure_output_wide() %>% 
  # Now do other recipe stuff (if needed) to the predictor columns
  step_normalize(starts_with("measure_"))

sg_wflow <- workflow(sg_rec, linear_reg())

# Do tidymodel stuff as usual

sg_res <- sg_wflow %>% fit_resamples(meat_rs)
collect_metrics(sg_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   4.03     10  1.26   Preprocessor1_Model1
#> 2 rsq     standard   0.843    10  0.0780 Preprocessor1_Model1

Created on 2024-11-07 with reprex v2.1.0

We are going to refine the syntax for the steps, and more processing steps, and do some work with tuning parameters. For the parameters, there are constraints with SG such as the degree needs to be <= the window size. I'll have a dials PR soon to broaly incorporate these types of constraints into the system etc.

Please put in issues for anything that you are interested in and, if possible, contribute code!

An R package would definitely be easier, but I am currently just using {reticulate} to pass data to scipy.signal and pybaselines for spectral pre-processing. I'm doing this on a smaller scale (just in notebooks and Shiny apps) so it might not be as robust of a solution as you're looking for, but I will say it is working reliably. I'm using Poetry and {renv} to reproducibly setup the Python and R environments.

Thanks for the replies! Really appreciate it.

I'll raise the issue and potentially see if I can contribute with something useful.

As a reference for other people, the Mdatools package has many of the abovementioned pre-processing steps implemented as well as models specific to spectral analysis/chemometrics. But the library is, unfortunately, not directly compatible with the tidymodels framework.