Looking for intuitive method for doing grouped analyses without purrr

kinnersleyn · April 16, 2025, 11:13am

Sometimes I want to run a model (e.g. lm(), glm() etc) separately for each level of a grouping variable rather than fit the variable as a covariate, so as a relative newcomer to R, I've tried to learn {purrr} but I struggle with the syntax and wonder if there's a simpler way. As context, I came to R after 30 years of using SAS where it was intuitive syntax (you just add a "BY statement" to existing code and the models for each level of the By-group are seamlessly produced, including output e.g. see SAS Help Center).
It would be nice to have the ability to easily adapt my stats model when working on subsets/groupings to use something like the same way "facet_" does for ggplot.
For example, using penguins dataframe from {palmerpenguins}, syntax would be something like:
Without subgrouping:

penguins |> lm(bill_length_mm ~ flipper_length_mm) |> summary()
With subgrouping:
penguins |> group_by(island) |> lm(bill_length_mm ~ flipper_length_mm) |> summary()

In contrast, when I've tried to learn purrr, I seem to have use lists and .x syntax which (IMHO) should be hidden from the user in the same way facet_ does in ggplot2.
I think nest_by() gets close but in reviewing an example I think it doesn't quite hide all of the complexity (e.g. see Using in-line grouping to fit many models – Mike Mahoney)
Just so I'm clear, I'm not criticising the power of {purrr} itself but rather am interested to know if there's a way to mimic the simplicity of by-group processing of stats modelling with tidyverse syntax in a way like facet_ does for ggplot2

mduvekot · April 16, 2025, 2:16pm

does this not work for you?

models <- penguins |> 
  summarise(
    model = list(lm(formula = bill_len ~ flipper_len)),
    .by = island
  )

kinnersleyn · April 16, 2025, 4:48pm

Thanks for suggestion and, yes, it works but according to Mike Mahoney's post (mentioned in my original) "...future function calls no longer have access to your raw data frame" and when I look at the documentation for the .by argument that appears to be by design. Hence why Mike discusses the use of map() to be able to access the elements of the model...so we're back to needing to learn purrr

Perhaps a better example of mine should have been the following to show how moving from the overall/pooled analysis becomes a bit cumbsersome when wanting by-groups:

# overall analysis has elegant/concise syntax with pipes and use of broom::tidy
models_pooled_tidy <- penguins |> 
  lm(formula = bill_length_mm ~ flipper_length_mm) |> 
  tidy()

# in following code for by-groups, I've needed to add a mutate statement and use lists and specify the data=data syntax

models_by_island_tidy <- penguins |> 
  nest_by(island) |>
  mutate(
    mod = list(lm(formula = bill_length_mm ~ flipper_length_mm,
                  data=data)),
    res = list(tidy(mod)))

It's certainly do-able but just curious if a simpler syntax were available (like the ease of adding facet_ to ggplot code)

mduvekot · April 16, 2025, 9:28pm

I'm probably missing something, but what about something like:

penguins %>% 
  summarise(
    .by = island,
    model = list(lm(formula = bill_len ~ flipper_len))
  ) %>% 
  rowwise() %>% 
  mutate(res = broom::glance(model))

kinnersleyn · April 17, 2025, 11:49am

Many thanks for continuing to help me . Addition of rowwise() is a perfect solution when using glance() because it nicely preserves the column for island within the "rectangular" structure of the glance output - so thank you.
However, when I replaced glance() with tidy() I got an error because res needed to be size 1. I fixed it by adding a list() around tidy and it worked but (obviously) then res becomes nested below island so a bit more processing is required to get the "tidied" results in the same dataframe as island.
Here's the code I used:

models_dotby_island_tidy <- penguins |> 
  summarise(
    .by = island,
    model = list(lm(formula = bill_length_mm ~ flipper_length_mm))
  ) |> 
  rowwise() |> 
  # mutate(res = glance(model)) # for glance don't need a list
  mutate(res = list(tidy(model))) # for tidy need a list

models_dotby_island_tidy

In summary, thanks for suggesting to use the mix of .by and rowwise which certianly is an impreovement for me (& helps me minimise the use of purrr).
If anyone else has ideas about if/how to have something as succinct as facet_ style syntax when modelling data it would be great to hear.

wasd · April 17, 2025, 1:32pm

I'm sorry if I'm missing anything, but starting with your "without subgrouping" code:

penguins |> 
  lm(bill_length_mm ~ flipper_length_mm, data = _) |> 
  summary()

Don't you just need to do this?

split(penguins, penguins$island) |>
  map(function(df) {
    lm(bill_length_mm ~ flipper_length_mm, data = df) |> 
      summary()
  })

Giving you a named list (the names being the corresponding islands). I prefer calling split(), because the dplyr alternative doesn't provide names:

penguins |>
  group_by(island) |>
  group_map(function(df, ...) {
    lm(bill_length_mm ~ flipper_length_mm, data = df) |> 
      summary()
  })

Overall, though, I much prefer the approach where things that start as a tibble, remain in a tibble as long as possible:

penguins |>
  nest(nested = -island) |>
  rowwise() |>
  mutate(
    model = lm(bill_length_mm ~ flipper_length_mm, data = nested),
    summarized = summary(model)
  ) |>
  ungroup() # Remember to undo rowwise()

In the latter example, it's then also easy to

jakasspeech6 · May 6, 2025, 10:27pm

It’s great that it’s working!