I am confused about the advantages of using recipe steps for data transform as opposed to modifying the data itself.
for example, if I have a process like:
Get data
Simple cleaning
Split
Explore training data
And this process leads me to believe that I want to log transform my dependent variable, what is the advantage of adding to a recipe step_log(y), as compared to adding a mutate(y=log(y)) to my Simple cleaning process above and then rerunning Split.
It is easier to make sure things are going as intended if you modify the actual data, I think. I do see that there are some very handy recipe steps, so that is an advantage, are there others? A disadvantage is that it is harder to evaluate choices (e.g. picking parameters for step_other).
If you are doing any deterministic, non-estimation type stuff then it make sense to do those parts up-front as you suggest.
Otherwise, it's best to put it into a recipe so that your performance statistics are appropriate. Using a recipe also has the side benefit of you not having to code anything special when new data arrives.
Some examples that are good for up-front work:
computing features from dates (e.g. month, day of the week, etc).
log transformations
Things that should really go into a recipe:
PCA
centering, scaling, Box-Cox transformations (all use statistical estimates)
feature selection
imputation
Some of this is a bit philosophical. Take a look at this video where we discuss this and the ideas/why behind our recommendations.