Converting a continuous variable to a discrete value for regression

Hi there

I have a manufacturing process where it takes 20 days to make a product. During this time, many variables (predictors) can affect the product quality, for example temperature, level of gases from the pumped air. At the end of the process, a sample is taken to check the quality (response). Because there is only 1 response for many values of predictors, I have tried using the area under the curve to convert the continuous variables into a single number as an input for machine learning. But this approach is questionable because a taller and narrower curve could have a similar area as a shorter and wider curve, correct?

Are there any other options?

Thank you and regards.

1 Like

It seems to me that you have time series data, and you want to predict a single response (quality). What is quality? Is it a quantitative variable (a real or integer number) or is it a qualitative variable ("reject", "rework", "accept")? What is the machine learning algorithm you're talking about, and what do you want to do with it? Do you want to predict the quality of a new product, given the time series of the predictors recorded during the 20 days? Please provide a reproducible example. That will go a long way towards helping us to help you!

PS I'm curious, which kind of product requires exactly 20 days to be manufactured? I work in heavy industry and here times are so much longer, so I guess it's something smaller and/or simpler than heavy machinery.

PS2 you're most likely right that the AUC approach is the wrong one, but we need a reprex to tell for sure.

1 Like

Hi Andrea

Sincere apologies for only replying now. The quality attribute is a continuous positive number and will never be negative, so I would like to predict its value with a confidence interval at the end of the process (regression problem) given the set of time series of predictors plus some discrete predictors.

This paper (http://www.adchem2018.org/USB/media/files/0088.pdf) describes what I am thinking of doing. Note that some variables are measured continuously every 30 seconds and others are measured daily or once at end of process.

1 Like

No need to apologize, this isn't work (for me, at least), so there are no deadlines :slight_smile: I still have to reply to a post from before Christmas!

Ok, I see. My two cents:

  1. I wouldn't go for a machine learning approach here. Except for Gaussian Processes & PLSR (which are the most "statistical" of the approaches shown in that paper :slight_smile: ) it's not so straightforward to get confidence intervals out of any of the other methods. And I'm not talking about the fact that they are nonlinear, nonparametric or even non-statistical methods (SVMs are not probabilistic classifiers, they do not compute a class-conditional probability). You could use bootstrap (or take a Bayesian approach) to get some kind of CI (credible interval, if using Bayesian inference) for all of these methods. The point is that for nonparametric (distribution-free) approaches, it's not so immediate to estimate the uncertainty when you have both cross-sectional predictors (e.g., the sex of an individual) and time-varying predictors (e.g., blood pressure, body temperature, metabolic rate, heart rate, etc.). I'm not saying it's impossible, but it definitely isn't straightforward. Instead, there are principled ways to compute a CI on your output, if you use more classical statistical approaches. In particular, GAMs (see package mgcv) or scalar-on-function regression (see this SO question for a simple introduction and some references regression - Estimating the effect of different histories of exposure, on a scalar response measured at the end of a study - Cross Validated) can accomodate both cross-sectional and time-varying predictors without a sweat. Different sampling rates for different predictors are a bit of a pain, but you can still accomodate them. I would definitely use GAMs or scalar-on-function regression as a first thing, to develop a strong baseline.

  2. After this, if you're still unsatisfied with the results, you may want to turn to Machine Learning. Be sure to keep a test set and don't use it to perform model selection: just use it at the end, to compare the ML model against the "classical" one. For me ML mostly means "a distribution-free approach with focus on predictive accuracy". Of all the ML approaches which can accomodate cross-sectional & sequence data at the same time, by far the best, IMO, is the combination of a large enough CNN/MLP and a LSTM. This will work, but:

    • it won't be straightforward to implement, if you don't have prior experience with Deep Learning. Realistically, set aside two man-months to do this.
    • you will need to have enough data, if you want to trounce the "classical" approach. Also, don't even think to train this baby without a GPU: if you aren't familiar with GPU training, forget it. Even if you offload most of the annoyance (i.e., installation and dependency handling of the GPU libraries) to a cloud service provider, you will still have to get acquainted with some of the nasty bits of training modern NN models on GPUs, and it will take time. Maybe try the approach shown in your paper.
    • the nasty part now is to get the CI. The bag of little bootstraps is probably the safest approach to use: it does work for cross-sectional data or time-series data. With predictors of both type, I'm not sure if/how you can use it...set aside some more time to investigate this. Maybe you could try dropping a line to the authors of the BLB paper. The other approach is Bayesian Deep Learning: however, the most commonly used BDL method to estimate (credible) intervals is dropout. And the accuracy of dropout as an approximation to the exact Bayesian posterior has been called into question by Deepmind researchers.

Best of luck!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.