Pearson residuals in linear regression


I'm a beginner in R and I'm looking for a way to identify and remove Pearson residuals at +/- 3 SD from their mean in a linear regression.

I executed the linear regression:
mod <- lm(y~x)

And then identified the residuals:

But I can't find the way to identify and remove those at +/- 3SD from their mean. How can I standardize these residuals and remove some?

Thank you!

Below is an example of how this could be done. I would advise you to be very cautious about filtering a data set based on this!

set.seed(1) #make the code reproducible
DF <- data.frame(X = 1:100, Y = 1:100 * 2.3 + 6.97 + rnorm(100))
DF[23, 2] <- 0 # put in a deviant value

mod <- lm(Y ~ X, data = DF)
Resid <- resid(mod)
mean(Resid) # This should be very close to zero!
#> [1] 7.701955e-17

#> [1] -58.847794   2.984806

sdResid <- sd(Resid) # ~ 6.039

Outliers <- which(abs(Resid) > 3*sdResid)

ResidFiltered <- Resid[-Outliers]
#> [1] -1.685001  2.984806

Created on 2021-01-26 by the reprex package (v0.2.1)

Thank you for your answer!

I have trouble undertanding the meaning of the numbers on lines 2 and 3.
R is really new for me, so I'm still trying to figure out the basics! :wink:

Thanks again,

If you mean the lines above, I am simply invented some data to work with. The details are not important but I will explain them briefly. I made a data frame with a column named X with values that run from 1 to 100. The column named Y is calculated as

Y = 2.3 * X + 6.97

Instead of writing X, I wrote 1:100 again, which is how I defined X. To prevent the relationship being perfectly linear, leaving no residuals in the model, I added some Gaussian noise using the rnorm() function. Writing rnorm(100) produces 100 values randomly drawn from a normal distribution that has a mean of zero and a standard deviation of one. It is equivalent to writing rnorm(100, mean = 0, sd = 1). All of that together produces a Y column that is linearly related to the X column but has some noise around the linear relationship. In the usual case, you get your data from some measurements and do not have to bother to invent a data frame.

The line

DF[23, 2] <- 0

substitutes a zero for the number in the 23rd row, 2nd column, that it, the 23 value of Y. I wanted to put in one value that would be very far from the linearly derived values so it would fall outside of the 3-sigma limits of the residuals.

Great, it's really clear now. Since I'm working on an existing data base, I'll only use the code starting at line 4 and down.

Thank you very much for your help,

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.