Working on an approach to logging models through time

robertmitchellv · August 8, 2018, 5:17pm

Hi all!

I know that for many working data scientists, there is already something that has been developped for you to log model objects, perhaps data, summary output, and other important features as you do your work of continuing to improve your model or try new approaches to answer whatever question you're interested in.

For folks like me, who might be the only person in their organization doing this work, I've been thinking about how I want to keep track of model performance in time as the data I feed it changes, as parameters might change, as summary outputs might change, etc. I wanted to try and create a data structure for logging not just a model, but all of the other important associated pieces of data that I might need to know about should I ever need to go back and understand how or why something changed.

I have a rough draft of how I'm thinking about it that I would like to further develop into a small package with helper functions. The approach wouldn't be useful for everyone--especially for those working with huge volumes of data, but for those of us with relatively small data, this could be a great way to keep track of a project over the course of a year.

Just looking for some more feedback--I shared this on rOpenSci's slack and was worried it was garbage, but some folks said it's a nice approach, so I thought I'd get more feedback and tag @Max to see if this could be useful with the new suite of tidymodels tooling (NEED STICKER).

I was talking with @alexpghayes about it and I like his idea of a grammar of modeling in R, which would make really great metadata for a log--especially keeping track of the conceptual attributes of a model and its family with the implementation of that concept with parameters and data captured in one event/record/row.

library(tidyverse)
library(broom)

auto <- ISLR::Auto %>% janitor::clean_names()

model_log <- auto %>% 
  mutate(
    model_type = "Supervised",
    model_subtype = "Linear Regression",
    data_name = "auto", 
    data_source = "From ISLR",
    date_run = Sys.time()) %>%
  group_by(model_type, model_subtype, data_name, data_source, date_run) %>%
  nest() %>%
  ungroup() %>%
  mutate(
    model = map(data, ~ lm(
      mpg ~ horsepower, 
      data = .))) %>%
  mutate(
    tidy = map(model, ~ tidy(.)),
    glance = map(model, ~ glance(.)),
    augment = map(model, ~ augment(.)), 
    notes = "",
    session_info = list(sessionInfo()))

auto %>%
  mutate(
    model_type = "Supervised",
    model_subtype = "Linear Regression",
    data_name = "auto", 
    data_source = "From ISLR",
    date_run = Sys.time()) %>%
  group_by(model_type, model_subtype, data_name, data_source, date_run) %>%
  nest() %>%
  ungroup() %>%
  mutate(
    data = map(
      data,
      ~ mutate(
        .x,
        horsepower2 = horsepower**2)),
    model = map(data, ~ lm(
      mpg ~ horsepower + horsepower2, 
      data = .))) %>%
  mutate(
    tidy = map(model, ~ tidy(.)),
    glance = map(model, ~ glance(.)),
    augment = map(model, ~ augment(.)),
    notes = "",
    session_info = list(sessionInfo())) %>%
  bind_rows(model_log) %>% 
  arrange(date_run) -> model_log 

model_log

Any thoughts?

An example of the kind of detail I'm aware of/thinking about but not addressing yet is: whether or not a model object is also storing input data--don't want to duplicate things, but that's a finer detail I'd try to work out later.

Max · August 10, 2018, 5:07pm

Good ideas!

There mlflow too.

I've been working on stuff like this for about a decade, mostly for updating models on updated data sets and tracking changes and performance.

I think that we need a good solution, but the first step is to write (and get feedback on) a specification of what you want to do, why, and so on before writing code.

For example, mlflow is a nice system but we are working on putting some sort of convention in place. For example, it you want to track performance, let's come up with some conventions to annotate how that performance was calculated. Was it a test set? The same test set as last time? Cross-validation? etc. Right now, mlflow and other solutions are like XML: a specification for data but it they are too generic.

robertmitchellv · August 10, 2018, 7:40pm

Thanks for responding Max!

And for pointing me to mlflow--by the link you sent, I take it there's a private fork RStudio is working on? This effort seems like a perfect fit within the ecosystem--especially given sparklyr.

I also percent agree about the urgency of a specification for the model and the need for this conceptual work to precede the actual implementation in code using parameters and data.

An outcome based strategy, like you suggest, could be really useful since it would condition the kinds of metadata you would need in order to achieve your goal. I imagine from a knowledge organization perspective, creating the classification scheme for this would be really challenging since the domain specific jargon isn't unified within a kind of statistical modeling/machine learning ontology (which is a sort of dream of mine to work on, haha).

Naively, starting with metadata for specific kinds of goals sounds like a really good place to start though (I may not have enough experience to know otherwise). If an end-user wants to perform cross-validation they can be given a specification that will enable them to achieve this kind of goal. In fact, I can see usethis being a really powerful tool for achieving this when starting a project, e.g., specifying desired (perhaps multiple) outcomes and having that scaffolding built into an Rproj would abstract the conceptual thinking/work away from the end-user in a way that would make this model annotation work appear less austere/daunting.

Of course, I'm getting way ahead of myself here. I was just imaging what a final version would look like. That's sort of what that code block I posted was--imagining how I could bridge some of the conceptual work with the actual implementation in one file. Something like that in conjunction with a specification and set metadata would be incredibly powerful, I think. I imagine that with your experience in this field there is a lot to draw on in thinking about core sets of metadata that can become a good foundation for doing this kind of continual documentation of models, data, parameters, and how they all change through time.

I'm not a veteran statistical modeler at all, so I'm not sure what kinds of contributions I could really make, but I'd love to be involved in any of this work.

Max · August 10, 2018, 8:10pm

Yeah, I didn't notice that it's private (for now). Hopefully, I didn't mess anything up by noting that.

Don't let that stop you at all!

I'd first take some application or example that you do know a lot about and start thinking about how it would generalize. What are the characteristics of the modeling process would need annotation? Preprocessing methods? How would we concisely articulate/label them in a model history "database"? Start doing a lot of "what if?" thinking and document it in GitHub. Then ask for people to shoot holes in it and keep editing.

robertmitchellv · August 10, 2018, 8:25pm

Thanks for the encouragement! I think I'll try to build the thing that I need for now and see how extensible I can make it while I'm doing that "what if?" thinking