Augment() inconsistent behaviour

kasramhdz · September 10, 2023, 7:01am

When I use a traditional model object in the augment(), it returns:

the predictions in .fitted column
depending on whether new_data or data arguments were provided we'd also get: .resid (& .std.resid, .hat, .sigma, .cooksd.
it would also compute intervals when the interval element is provided.

fit_trad <- lm(mpg ~ wt, data = mtcars)
augment(fit_trad, data = mtcars)

However, if you provide it with a model wokflow object:

This time you'll only get the prediction columns and with a different name: .pred
Augment would not accept data argument and accepts only the new_data (or newdata?! according to the help page) argument.
Providing the interval argument doesn't seem to do anything.

rec <- 
  recipe(mpg ~ wt, data = mtcars)

spec_lm <- 
  linear_reg() %>% 
  set_engine("lm")

wf <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(spec_lm) %>% 
  fit(data = mtcars)

augment(wf, new_data = mtcars, interval = "confidence")

Now if you feed it a Parsnip object:

results are almost similar to providing the workflow object,
except this time, you'd get the .resid column too but not the .std.resid
Unlike the using the wf, here you'll have to apply the recipe to the new_data the separately. (I just realized this in my original code)

parsnip <- wf %>% extract_fit_parsnip()
augment(parsnip, new_data = mtcars, interval = "confidence")

So say, I want to still use tidymodels approach but get the results produced had I provided traditional model fit, I'd extract the model fit from the parsnip object and then plug it into the augment.
This would yield the confidence interval but:

Other columns of the original data are removed!
Sometimes (despite plugging in the training data) this approach won't give you the .std.resid
The outcome column name is changed to ..y

fit_new <- wf %>% extract_fit_engine()
augment(fit_new , new_data = mtcars, interval = "confidence")

This is so confusing. I expected to just plug in the workflow object and new_data and get exactly what I would have got had I plugged in the traditional model fit.

P.S: the newdata / new_data argument is also confusing. the help document says augment() argument is newdata. but actually:

If you are using the tidymodel objects you should use new_data
If you are using a traditional model (example 1) it's newdata. and in this case if you don't provide it (or mistakenly plug in new_data), then augment() would silently use the data used for the model fit.

simoncouch · September 25, 2023, 2:23pm

Thanks for the post!

re: 2.

This time you'll only get the prediction columns and with a different name: .pred

tidymodels can only promise internal consistency here---any model object situated in a workflow will always house its predictions in a predictable way.

Augment would not accept data argument and accepts only the new_data (or newdata?! according to the help page) argument.

re: not accepting data, tidymodels is a predictive-modeling focused framework and thus does not make resubstitution (predicting on training data) easy.

re: the change to new_data, this argument name is snake case and thus internally consistent with the way we name arguments in the rest of the package ecosystem. We prioritize consistency among the packages we write over consistency elsewhere, as we can't control what other package authors do.

Providing the interval argument doesn't seem to do anything.

Yup, this ought to be better documented, as in augment.model_fit(). Thanks!

re: 3.

except this time, you'd get the .resid column too but not the .std.resid

With both workflows and parsnip fits (dev versions), I see the same result. When you say "but not the .std.resid" I believe you're referring to the lm tidy method, which we don't promise consistency with.

Unlike the using the wf, here you'll have to apply the recipe to the new_data the separately. (I just realized this in my original code)

This is exactly the benefit of using a workflow.

I expected to just plug in the workflow object and new_data and get exactly what I would have got had I plugged in the traditional model fit.

The parsnip and workflow tidying methods are designed to be as internally consistent as possible (workflow in, same type of output out). It would be nice to get that additional information you're describing out of, say, the method for lm(), but if I pass output from different modeling approaches into their own tidying methods, I will get different output columns depending on what's well-defined for that model.

re: 4.

I want to still use tidymodels approach but get the results produced had I provided traditional model fit... This is so confusing. I expected to just plug in the workflow object and new_data and get exactly what I would have got had I plugged in the traditional model fit.

A trade-off in this case, I suppose. Exchanging consistency for richness.

P.S: the newdata / new_data argument is also confusing. the help document says augment() argument is newdata .

The help documents for workflows and parsnip methods refer to the arguments they use, and the help documents for those exported from broom refer to the arguments they use. The help document for the generic does not refer to either. Please refer to the documentation for the method you're using and let us know how you feel the documentation for a specific method could be improved!

For those interested, there’s some additional context here and in linked threads: add residuals when outcome is available in `augment.workflow()` by simonpcouch · Pull Request #201 · tidymodels/workflows · GitHub

system · October 16, 2023, 2:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.