Using stat_ instead of dplyr to summarize groups in a ggplot?

Here there,

I would like to create a usual ggplot2 with 2 variables x, y and a grouping factor z.
On top of the plot I would like a mean and an interval for each grouping level (so for both x and y).
I can do that using dplyr and multiple call to ggplot functions (see below) but I find the code rather ugly and I wonder if one cannot use cleverly one of the stat_ functions (or something else) to get the same thing in 3 lines of code without having to do the dplyr steps by hand.


iris %>%
  group_by(Species) %>%
  summarize(mean_Petal.Length = mean(Petal.Length),
            mean_Sepal.Length = mean(Sepal.Length),
            se_Petal.Length = sd(Petal.Length)/sqrt(length(Petal.Length)),
            se_Sepal.Length = sd(Sepal.Length)/sqrt(length(Sepal.Length))) %>%
  ggplot(aes(y = mean_Petal.Length, x = mean_Sepal.Length, colour = Species)) +
    geom_point() +
    geom_errorbarh(aes(xmax = mean_Sepal.Length + 2*se_Sepal.Length,
                       xmin = mean_Sepal.Length - 2*se_Sepal.Length),
                   height = 0.1) +
    geom_errorbar(aes(ymax = mean_Petal.Length + 2*se_Petal.Length,
                      ymin = mean_Petal.Length - 2*se_Petal.Length),
                  width = 0.05) +
    geom_point(data = iris, aes(y = Petal.Length, x = Sepal.Length, colour = Species),
               shape = 1) +
  theme_classic() +
  labs(y = "Petal Length", x = "Sepal Length")

Many thanks,


PS: hopefully the solution would also ensure that vertical and horizontal flat arrow heads have same length without having to make the code even uglier.

1 Like

To generate the mean points and error bars, we need to be able to use the mean x and y locations of each Species as the x and y aesthetics, respectively. To do that, we need to add those values to the data frame (which I've done below with left_join). Then we can use stat_summary and stat_summaryh (from the ggstance package). In the example below, I've also used summarise_at to shorten the code that generates the means.


# Join means by Species to the original data frame and pipe to ggplot
          iris %>%
            group_by(Species) %>%
            summarise_at(vars(matches("Length")), funs(mean=mean))
          ) %>% 
  ggplot(aes(colour=Species)) +
    # Plot raw data
    geom_point(aes(x=Sepal.Length, y=Petal.Length), shape=1) +
    # Plot mean by Species
    geom_point(aes(x=Sepal.Length_mean, y=Petal.Length_mean), shape=18, size=3) +
    # Add vertical errorbars 
    stat_summary(, fun.args=list(mult=1.96), 
                 aes(x=Sepal.Length_mean, y=Petal.Length),
                 geom="errorbar", width=0.05) +
    # Add horizontal errorbars
    stat_summaryh(, fun.args=list(mult=1.96), 
                  aes(x=Sepal.Length, y=Petal.Length_mean),
                  geom="errorbarh", width=0.1) +


Also, for future reference, your original code can be shortened by calculating the means and standard errors with summarise_at:

iris %>%
    group_by(Species) %>%
    summarise_at(vars(matches("Length")), funs(mean, se=sd(.)/sqrt(n())))

Here's a function that attempts to automate the creation of the plot above. It may look a bit confusing if you haven't studied non-standard evaluation in the tidyverse (I still find it more than a bit confusing myself):


fnc = function(data, group, x, y, adj=1) {

  x = enquo(x)
  y = enquo(y)
  # Set size of whisker end caps
  wv = data %>% pull(!!y) %>% range %>% diff/100*adj
  wh = data %>% pull(!!x) %>% range %>% diff/100*adj
  # If grouping variable is numeric, turn it into a factor
  if(data %>% pull(!!group) %>% is.numeric) {
    data = data %>% 
      mutate(!!quo_name(group) := factor(!!group))

  # Generate column names for the x and y means that we'll calculate below
  xmean = paste0(quo_text(x), "_mean")
  ymean = paste0(quo_text(y), "_mean")
            data %>%
              group_by(!!group) %>%
              summarise_at(vars(!!x, !!y), funs(mean=mean))
           ) %>% 
    ggplot(aes(colour=!!group)) +
      geom_point(aes(x=!!x, y=!!y), shape=1) +
      geom_point(aes(x=!!sym(xmean), y=!!sym(ymean)), shape=18, size=3) +
      stat_summary(, fun.args=list(mult=1.96),
                   aes(x=!!sym(xmean), y=!!y),
                   geom="errorbar", width=wh) +
      stat_summaryh(, fun.args=list(mult=1.96),
                    aes(x=!!x, y=!!sym(ymean)),
                    geom="errorbarh", width=wv) +

Now let's try out the function:

  fnc(iris, Species, Petal.Length, Sepal.Length),
  fnc(starwars %>% filter(mass < 500, grepl("Human|Droid|Wookiee|Yoda", species)), 
      species, height, mass),
  fnc(mtcars, cyl, mpg, hp),
  fnc(mtcars, cyl, mpg, hp, adj=3),


Very nice @joels. This a very useful function, and it even works with faceting as well:

fnc(iris, Species, Petal.Length, Sepal.Length) + facet_wrap(~ Species)

It would be great to have this as part of ggplot2 (or an other package), so we could just write something like

ggplot(iris, aes(x=Petal.Length, y=Sepal.Length)) + geom_point() + stat_errorbars(group = Species)

I like to idea of tp2750 to input this as a new function but I think it should be a geom_ one.
@joels, would you like to enter it as a feature request or to integrate it directly in ggplot2, or shall I do it?
I think that it could be useful for a bunch of users.

I think it should be a geom_ one

I agree @courtiol , but geom_errorbar() is already taken, and I think geom_errorbars() is too close to that. Perhaps geom_spread()?

On the other hand, the functionality is related to stat_summary(), and it is clearly a statistics, so stat_errorbars() would not be unreasonable.

I had a quick look at the source of stat_summary(), and it looks like it would not be too difficult to implement.

I think you should add the feature request, as it was your idea.


1 Like

Now submitted as feature request:

This turns out to be quite easy, once you bite the bullet, and look into the ggplot2 extension mechanism.

SE <- function(x) sqrt(var(x)/length(x))

StatErr <- ggproto("StatErr", Stat,
           compute_group = function(data, scales, na.rm = FALSE, spread = "sd", mult = 1) {
    x <- na.omit(data$x)
    y <- na.omit(data$y)
    xMean <- mean(x)
    yMean = mean(y)
    if(spread == "sd") {
      xDelta <- mult * sd(x)
      yDelta <- mult * sd(y)
    } else if(spread == "se") {
      xDelta = mult * SE(x)
      yDelta = mult * SE(y)
    } else {
      stop("stat_err only knows spreads: 'sd', 'se'")
      data.frame(x=xMean, y=yMean, ymax=yMean + yDelta, ymin = yMean - yDelta, xmax = xMean + xDelta, xmin = xMean - xDelta)

stat_err <- function(mapping = NULL, data = NULL, geom = "errorbar",
                        position = "identity", na.rm = FALSE, show.legend = NA, 
                        inherit.aes = TRUE, ..., spread = "sd", mult = 1) {
    data = data,
    mapping = mapping,
    stat = StatErr,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      spread = spread,
      mult = mult,

## Then we can do like this:

ggplot(data =  iris, aes(x= Petal.Length, y=Petal.Width, group=Species)) + geom_point(aes(color=Species)) + stat_err(width=.1, spread = "se", mult = 2) + stat_err(geom="point") + stat_err(geom="errorbarh", height=.1, spread = "se", mult = 2)

The result is:

As @hadley pointed out when closing the feature request, we will need a new geom to be able to get both vertical and horizontal errorbars with a single call to a stat.

It looks like @thomasp85 has recently done some work that might make this easier: Removing direction constraint from geoms (#3506) · tidyverse/ggplot2@10fa001 · GitHub


Thanks @tp2750, the link to the PR from @thomasp85 is particularly useful and announce a major upcoming change in ggplot2 that I had failed to notice! This should indeed sort the present situation and many other similar ones! Great job @thomasp85!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.