Please advise on what i can do to outlier before i Creating A Linear Regression Model

I want to run a linear model so I visualize the outliers of a trade by creating a boxplot. Which shows that there are outliers in the variable. There are multiple outliers visible as individual points above the upper whisker of the box plot. • The box plot shows a right-skewed (positively skewed) distribution for the trade variable, as evidenced by:
• The median line (thick horizontal line inside the box) is closer to the bottom of the box.
• The upper whisker is longer than the lower whisker.
• The presence of high outliers.
• Central Tendency: The median appears to be around 50 units on the trade scale.
• Spread: The interquartile range (represented by the height of the box) spans from approximately 35 to 70 units on the trade scale.
• Range: Excluding outliers, the data ranges from about 10 (lower whisker) to 100 (upper whisker) on the trade scale.
• Outlier Magnitude: Some outliers extend beyond 120 on the trade scale, which is significantly higher than the upper quartile.

• coefficient <- cor.test(d$gdp, d$trade)
• > coefficient$estimate
• cor
• 0.2357552
• > plot <- d %>%
• + ggplot(aes(trade, trade)) +
• + geom_boxplot()

Usually, you should not do anything to points that are outside the whiskers of a box plot. Unless you have a specific reason to exclude a data point you should keep it. The whiskers do not mark the limits of possible data, they mark the likely extent of a small normally distributed data set. If the data have a lot of skew, or if it is a very large data set, points beyond the whiskers are to be expected.

An adequate reason to exclude a data point is that it is physically impossible or very unlikely. For example, if the height of a person is recorded as a negative number or as 5 meters, it would be reasonable to exclude the value. Another example would be measuring a voltage, then checking the meter calibration and finding that it is not working properly.

2 Likes

I don't see the issue with RStudio, but anyway ...

note that in (pearson) correlation analysis, which is not exactly the same thing as linear regression, both variables should follow a normal distribution, which does not seem to be the case here. I'm also not sure that we can talk about outliers in correlation analysis, as @FJCC said.

However, in linear regression analysis, for instance with lm(), the fit residuals should follow a normal distribution and, in that case, residuals that are very unlikely (let's say 1 % or even less) could be flagged as outliers if the fitted model and the weighting of the observations are trustworthy.