I have looked through the recipes package for pre-processing methods used for spectral data but I (with a few exceptions) cannot find any of the most used types of pre-processing in spectroscopy. I'm looking for pre-processing method such as:
Standard Normal Variate (SNV),
Multiplicative Scatter Correction (MSC)
Normalisation of spectra (area, length, Sum)
Baseline corrections (Automatic Weighted Least Squares, Automatic Whittaker Filter, Asymmetric Least Squares)
Savitzky-Golay filter (sliding window + derivative)
Am I missing the functions or are they missing from recipes?
Assuming they are missing, are anyone aware of a package that have these pre-processing methods, that also integrates well with tidymodels?
James Wade and I have been (intermittently) working on a package to do this called measure. It's still in development and the API will change a little but it does work and we have implemented Savitzky-Golay. I'm using it for publications so it will be a serious thing, just low on the priority list at this particular moment.
There are input steps that arrange the spectra in an internal tidy format and steps that can make them available to the model in either a wide or long format.
Here's a small example that can go from wide to wide with SG in between:
# Make the recipe
sg_rec <-
recipe(water ~ ., data = meat_train) %>%
# Make sure that the sample indicator is not treated as a predictor
update_role(.row, new_role = "id") %>%
# Since the spectra are currently "wide" use this step to convert to
# the internal format
step_measure_input_wide(starts_with("x_")) %>%
# Preprocess the spectra. These arguments can be tuned
step_measure_savitzky_golay(
differentiation_order = 1,
degree = 3,
# "window side" is how many points to each side. The window
# size is 2 * window_side + 1
window_side = 5
) %>%
# Put them back into a wide format. The default prefixes for the
# variables is "measure_
step_measure_output_wide() %>%
# Now do other recipe stuff (if needed) to the predictor columns
step_normalize(starts_with("measure_"))
sg_wflow <- workflow(sg_rec, linear_reg())
# Do tidymodel stuff as usual
sg_res <- sg_wflow %>% fit_resamples(meat_rs)
collect_metrics(sg_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 rmse standard 4.03 10 1.26 Preprocessor1_Model1
#> 2 rsq standard 0.843 10 0.0780 Preprocessor1_Model1
We are going to refine the syntax for the steps, and more processing steps, and do some work with tuning parameters. For the parameters, there are constraints with SG such as the degree needs to be <= the window size. I'll have a dials PR soon to broaly incorporate these types of constraints into the system etc.
Please put in issues for anything that you are interested in and, if possible, contribute code!
An R package would definitely be easier, but I am currently just using {reticulate} to pass data to scipy.signal and pybaselines for spectral pre-processing. I'm doing this on a smaller scale (just in notebooks and Shiny apps) so it might not be as robust of a solution as you're looking for, but I will say it is working reliably. I'm using Poetry and {renv} to reproducibly setup the Python and R environments.
I'll raise the issue and potentially see if I can contribute with something useful.
As a reference for other people, the Mdatools package has many of the abovementioned pre-processing steps implemented as well as models specific to spectral analysis/chemometrics. But the library is, unfortunately, not directly compatible with the tidymodels framework.