Hi all!
I know that for many working data scientists, there is already something that has been developped for you to log model objects, perhaps data, summary output, and other important features as you do your work of continuing to improve your model or try new approaches to answer whatever question you're interested in.
For folks like me, who might be the only person in their organization doing this work, I've been thinking about how I want to keep track of model performance in time as the data I feed it changes, as parameters might change, as summary outputs might change, etc. I wanted to try and create a data structure for logging not just a model, but all of the other important associated pieces of data that I might need to know about should I ever need to go back and understand how or why something changed.
I have a rough draft of how I'm thinking about it that I would like to further develop into a small package with helper functions. The approach wouldn't be useful for everyone--especially for those working with huge volumes of data, but for those of us with relatively small data, this could be a great way to keep track of a project over the course of a year.
Just looking for some more feedback--I shared this on rOpenSci's slack and was worried it was garbage, but some folks said it's a nice approach, so I thought I'd get more feedback and tag @Max to see if this could be useful with the new suite of tidymodels
tooling (NEED STICKER).
I was talking with @alexpghayes about it and I like his idea of a grammar of modeling in R, which would make really great metadata for a log--especially keeping track of the conceptual attributes of a model and its family with the implementation of that concept with parameters and data captured in one event/record/row.
library(tidyverse)
library(broom)
auto <- ISLR::Auto %>% janitor::clean_names()
model_log <- auto %>%
mutate(
model_type = "Supervised",
model_subtype = "Linear Regression",
data_name = "auto",
data_source = "From ISLR",
date_run = Sys.time()) %>%
group_by(model_type, model_subtype, data_name, data_source, date_run) %>%
nest() %>%
ungroup() %>%
mutate(
model = map(data, ~ lm(
mpg ~ horsepower,
data = .))) %>%
mutate(
tidy = map(model, ~ tidy(.)),
glance = map(model, ~ glance(.)),
augment = map(model, ~ augment(.)),
notes = "",
session_info = list(sessionInfo()))
auto %>%
mutate(
model_type = "Supervised",
model_subtype = "Linear Regression",
data_name = "auto",
data_source = "From ISLR",
date_run = Sys.time()) %>%
group_by(model_type, model_subtype, data_name, data_source, date_run) %>%
nest() %>%
ungroup() %>%
mutate(
data = map(
data,
~ mutate(
.x,
horsepower2 = horsepower**2)),
model = map(data, ~ lm(
mpg ~ horsepower + horsepower2,
data = .))) %>%
mutate(
tidy = map(model, ~ tidy(.)),
glance = map(model, ~ glance(.)),
augment = map(model, ~ augment(.)),
notes = "",
session_info = list(sessionInfo())) %>%
bind_rows(model_log) %>%
arrange(date_run) -> model_log
model_log
Any thoughts?
An example of the kind of detail I'm aware of/thinking about but not addressing yet is: whether or not a model object is also storing input data--don't want to duplicate things, but that's a finer detail I'd try to work out later.