Sometimes I want to run a model (e.g. lm(), glm() etc) separately for each level of a grouping variable rather than fit the variable as a covariate, so as a relative newcomer to R, I've tried to learn {purrr} but I struggle with the syntax and wonder if there's a simpler way. As context, I came to R after 30 years of using SAS where it was intuitive syntax (you just add a "BY statement" to existing code and the models for each level of the By-group are seamlessly produced, including output e.g. see SAS Help Center).
It would be nice to have the ability to easily adapt my stats model when working on subsets/groupings to use something like the same way "facet_" does for ggplot.
For example, using penguins dataframe from {palmerpenguins}, syntax would be something like:
Without subgrouping:
penguins |> lm(bill_length_mm ~ flipper_length_mm) |> summary()
With subgrouping:
In contrast, when I've tried to learn purrr, I seem to have use lists and .x syntax which (IMHO) should be hidden from the user in the same way facet_ does in ggplot2.
I think nest_by() gets close but in reviewing an example I think it doesn't quite hide all of the complexity (e.g. see Using in-line grouping to fit many models – Mike Mahoney)
Just so I'm clear, I'm not criticising the power of {purrr} itself but rather am interested to know if there's a way to mimic the simplicity of by-group processing of stats modelling with tidyverse syntax in a way like facet_ does for ggplot2
Thanks for suggestion and, yes, it works but according to Mike Mahoney's post (mentioned in my original) "...future function calls no longer have access to your raw data frame" and when I look at the documentation for the .by argument that appears to be by design. Hence why Mike discusses the use of map() to be able to access the elements of the model...so we're back to needing to learn purrr
Perhaps a better example of mine should have been the following to show how moving from the overall/pooled analysis becomes a bit cumbsersome when wanting by-groups:
# overall analysis has elegant/concise syntax with pipes and use of broom::tidy
models_pooled_tidy <- penguins |>
lm(formula = bill_length_mm ~ flipper_length_mm) |>
tidy()
# in following code for by-groups, I've needed to add a mutate statement and use lists and specify the data=data syntax
models_by_island_tidy <- penguins |>
nest_by(island) |>
mutate(
mod = list(lm(formula = bill_length_mm ~ flipper_length_mm,
data=data)),
res = list(tidy(mod)))
It's certainly do-able but just curious if a simpler syntax were available (like the ease of adding facet_ to ggplot code)
Many thanks for continuing to help me . Addition of rowwise() is a perfect solution when using glance() because it nicely preserves the column for island within the "rectangular" structure of the glance output - so thank you.
However, when I replaced glance() with tidy() I got an error because res needed to be size 1. I fixed it by adding a list() around tidy and it worked but (obviously) then res becomes nested below island so a bit more processing is required to get the "tidied" results in the same dataframe as island.
Here's the code I used:
models_dotby_island_tidy <- penguins |>
summarise(
.by = island,
model = list(lm(formula = bill_length_mm ~ flipper_length_mm))
) |>
rowwise() |>
# mutate(res = glance(model)) # for glance don't need a list
mutate(res = list(tidy(model))) # for tidy need a list
models_dotby_island_tidy
In summary, thanks for suggesting to use the mix of .by and rowwise which certianly is an impreovement for me (& helps me minimise the use of purrr).
If anyone else has ideas about if/how to have something as succinct as facet_ style syntax when modelling data it would be great to hear.