When ranking the models with tune package, the metrics are averaged over the samples assuming equal size (and linearity).
That's ok for equal size slices and linear metrics like MSE or MAE but it s dubious for unequal sizes and/or non linear metrics like AUC.
Say I do time-walk forward splits by month and my last slice of data has much less data in the validation set, it will benefit to the lucky models that performed well on this last data instead of properly weighting by the slice size for example.
I think the averages computed by collect_metrics should be at least weighted by the size of the sample. (at best recomputed from the predictions over the union of all validation sets.)
Yes good idea!
I did not start an issue as I did not know if this was handled somewhere I did not see.
The problem with your solution is that it's more work for me!
I tried to use workflowsets::rank_results and that's where it calls collect_metrics internally so I can't modify the call here.