Plot an average line over the scatter plot within each cell of facet_grid


Say I have population data on four cities (a, b, c and d) over four years (years 1, 2, 3 and 4). The population data is broken down into two age groups (age1 and age2). The cities also belong to two regions (region1 and region 2).

I have created a scatter plot showing how the cities' population have changed over time, broken down by region and age band using facet_grid. This is as follows:

The data and code used to produce this are as follows:

df <- data.frame(city = rep(letters[seq(1,4)],each =8),
                 region = as.factor(rep(c('region1','region2'), each = 16)), 
                 year = as.integer(rep(rep(seq(1,4), each = 2),4)),
                 age = as.factor(rep(c('age1','age2'), 16)),
                 population = sample(1:20,32, replace = TRUE))

ggplot(data = df, aes(x = year, y = population, color = city)) + 
  geom_point(size = 4) + 

What I am trying to do is overlay each scatter plot with a line showing the changing average population of the cities. So, like a geom_line(), where the value for each year is the average of the two city's population in each of the four grids.

I hope that makes sense, but please let me know if I can clarify,

Thank you!

You can summarize the data by panel and add point and line layers for the summarized data. In the code below, I've put the summarized data into the main ggplot call to avoid having to run the code twice (once for each geom that uses the data) and moved the original data frame into the first call to geom_point. The addition of group=city in the first call to geom_point is to avoid a .group not found error.

ggplot(df %>% group_by(region, age, year) %>% 
       aes(x = year, y = population)) + 
  geom_point(data=df, aes(color=city, group=city), size=4) + 
  facet_grid(age~region) + 
  geom_line(colour="blue", linetype="11", size=0.3) + 
  geom_point(shape=4, colour="blue", size=3)


Evidently my brain isn't firing on all cylinders today. You can do this more easily by using stat_summary. In the code below, group=1 overrides the by-city grouping so that we get a mean across all cities within each year, rather than the mean of each individual city within each year (which is the same as the individual observations in this case).

ggplot(df, aes(x = year, y = population, colour=city)) + 
  geom_point(size=4) + 
  facet_grid(age~region) + 
  stat_summary(fun.y=mean, aes(group=1), geom="line", colour="blue") +
  stat_summary(fun.y=mean, aes(group=1), geom="point", colour="blue", size=3, shape=4)

In reference to your comment below, even though we were able to use stat_summary here, for more complex cases, keep in mind that you can always add layers with transformations of the data or layers that use completely different data. You just need to make sure that you set the aesthetics within each layer to use the appropriate data columns.


@joels Brilliant, thank you very much! I didn't realise you could manipulate the data like that within a single plot