Ggplot() makes residual plots?!

I discovered this when a student did something I thought was a mistake:

ggplot(lm(Sepal.Length~Sepal.Width, data=iris)) + 
  geom_point(aes(x=.fitted, y=.resid))

When did this magic happen, and is there any documentation about how to work with it? Is it just converting the lm object to a data.frame with broom::augment()?

My students make residual plots of everything, so an easy way of doing this with ggplot2 would be great.

1 Like

From what I can see ggplot2 identifies the input as a lm class, which then performs the fortify.lm function who extracts: with source code here

Diagonal of the hat matrix

Estimate of residual standard deviation when corresponding observation is dropped from model

Cooks distance, cooks.distance

Fitted values of model


Standardised residuals

As mentioned here it is adviced to use the broom package, which also have support for more models, as fortify may be deprecated in the future.

1 Like

Simon Jackson (@drsimonj on twitter) has a :raised_hands: great post on plotting residuals in R, including with ggplot here


Yeah, I teach my students to use broom on the models and then make the plots with the resulting data.frame. But I've been trying to find some shortcuts because it gets old copying and modifying the 20 or so lines of code needed to replicate what plot.lm() does with 6 characters.

Yes, DRY, so I should make a function, and I have, but it's not working very well.

"From what I can see ..." -- where? is this done inside ggplot()? I looked for a ggplot.lm() method but ...

if you don't mind sharing the function we could take a look at it and see if we can make it better :blush:

I was a little brief in my last response so let me try to clarify a little deeper:

We can see the inner workings of ggplot() by typing ggplot2:::ggplot in the console (or pressing F2 with the cursor on the function) which gives us the following:

function (data = NULL, mapping = aes(), ..., environment = parent.frame()) 
<environment: namespace:ggplot2>

UseMethod("ggplot") is telling you that ggplot() is a (S3) generic function that has methods for different object classes. So we can list all the methods of ggplot() with the methods() function.

> methods(ggplot)
[1]* ggplot.default*   
see '?methods' for accessing help and source code

which tells us that there are currently two methods for the ggplot function. UseMethod will use the class of the input to figure out which method to use.
In our case was the output of an lm call which only have 1 class, namely "lm":

class(lm(Sepal.Length ~ Sepal.Width, data = iris))
[1] "lm"

ggplot.lm does not exist in the available methods as you correctly have noticed which leads UseMethod to fallback to look for a default method. That is, it looks for ggplot.default. Which it finds and calls. And if we look at the source code for ggplot.default we get the following

> ggplot2:::ggplot.default
function (data = NULL, mapping = aes(), ..., environment = parent.frame()) 
{, ...), mapping, environment = environment)
<environment: namespace:ggplot2>

where we can see that the data is fed into the fortify function which itself is an S3 generic function

function (model, data, ...) 
<environment: namespace:ggplot2>

with the following methods

 [1] fortify.cld*                     
 [2] fortify.confint.glht*            
 [4] fortify.default*                 
 [5] fortify.function*                
 [6] fortify.glht*                    
 [7] fortify.Line*                    
 [8] fortify.Lines*                   
 [9] fortify.lm*                      
[11] fortify.NULL*                    
[12] fortify.Polygon*                 
[13] fortify.Polygons*                
[14] fortify.SpatialLinesDataFrame*   
[15] fortify.SpatialPolygons*         
[16] fortify.SpatialPolygonsDataFrame*
[17] fortify.summary.glht*            
see '?methods' for accessing help and source code

where we locate fortify.lm which I refered to in my last response but for completeness type out again:

function (model, data = model$model, ...) 
    infl <- stats::influence(model, do.coef = FALSE)
    data$.hat <- infl$hat
    data$.sigma <- infl$sigma
    data$.cooksd <- stats::cooks.distance(model, infl)
    data$.fitted <- stats::predict(model)
    data$.resid <- stats::resid(model)
    data$.stdresid <- stats::rstandard(model, infl)
<environment: namespace:ggplot2>

which extracts the necessary information, and feds that into the method as it have the correct structure. Hope this was helpful :blush:


Thanks! Yes, I always forget to check the default method when I don't see the specific one.

My function:

The main thing I don't like is that it doesn't work with mgcv::gam objects, and it should probably do different things for glm objects.

the geom_segment idea is a really cool one!