Bake perfect cookies: methods or techniques to determine optimal ranges of variables in classification or logistic regression models

JasonAizkalns · January 26, 2018, 1:30am

This is probably best explained via an analogy: let's assume you're baking cookies because are awesome Let's also assume we can only control three variables in our cooking process: time, temperature, and humidity. We know there's an interaction between time and temperature, but let's also assume there are additional interactions amongst the other variables.

Our outcome variable is "cookie goodness" -- for the sake of argument, let's pretend it's a binary outcome -- we have "good cookies" and "bad cookies".

There are plenty of methods and models for classifying and understanding how to determine good cookies vs. bad cookies, but let's assume we want the best "settings" for time, temperature, and humidity. Since these are settings, we can assume they must be continuous "ranges" for their settings (for example: keep time between 15-25 minutes, temperature between 350 and 375, and humidity above 50...there is no other cross section of variables that will yield more "good cookies").

Are there methods or techniques for approaching a classification model with these constraints? Initially I think tree-based model, but a decision tree will likely return several interactions and since I need one ideal "setting" for each variable, how do I know I'm not missing some subset of "settings" down a different branch of the tree? Is this just an exercise of optimizing the tree and selecting the best branch? Is there some method or way to force a decision tree to only include one set of continuous ranges for each variable or to collapse/summarize the output of the tree to continuous rulesets for each variable? This almost feels like a reverse Design of Experiments (DOE)?

Then there's the problem when the tree suggests too "tight" of a range (e.g. keep temperature between 400 and 401 -- my oven's not that accurate)? I suppose we could check branches with high support and high confidence and then see if they have implementable ranges.

In case anyone wants to explain with some fake data, here you go:

library(tidyverse)

n <- 500
time <- runif(n, 15, 40)
temp <- runif(n, 300, 450)
humidity <- runif(n, 35, 60)

cookie_goodness <- (2*time + 3*time*temp + 4*humidity*temp  + 5*humidity + rnorm(n)) / 10000

good_cookie_lower_limit <- quantile(cookie_goodness, 0.25)
good_cookie_upper_limit <- quantile(cookie_goodness, 0.75)

df <- data_frame(
  trial = 1:n,
  time = time,
  temp = temp,
  humidity = humidity,
  good_cookie = as.factor(
    if_else(between(cookie_goodness, good_cookie_lower_limit, good_cookie_upper_limit),
            1, 0)
    )
)

# # A tibble: 500 x 5
#    trial  time  temp humidity good_cookie
#    <int> <dbl> <dbl>    <dbl> <fctr>     
# 1      1  40.0   377     54.5 0          
# 2      2  39.0   384     53.4 0          
# 3      3  30.9   325     38.4 0          
# 4      4  25.5   447     35.1 1          
# 5      5  19.0   348     44.7 0          
# 6      6  37.1   431     42.7 0          
# 7      7  34.8   400     43.5 1          
# 8      8  26.2   346     36.9 0          
# 9      9  19.2   409     38.0 0          
# 10    10  38.0   408     46.9 0          
# # ... with 490 more rows