Mutate identifier (outlier: yes or no)

library(dplyr)
set.seed(123)

# Toy data
weight_actual_a = rnorm(50, mean = 10, sd = 4)
weight_actual_b = rnorm(50, mean = 4, sd = 2)
weight_expected_a = rnorm(50, mean = 10, sd = 1)
weight_expected_b = rnorm(50, mean = 4, sd = .2)


df <- tibble(
  product = rep(c("A", "B"), each = 50),
  weight_actual = c(weight_actual_a, weight_actual_b),
  weight_expected = c(weight_expected_a, weight_expected_b)
) %>% 
  # difference between actual and expected weights
  mutate(difference = weight_expected - weight_actual)


head(df)
#> # A tibble: 6 x 4
#>   product weight_actual weight_expected difference
#>   <chr>           <dbl>           <dbl>      <dbl>
#> 1 A                7.76            9.29      1.53 
#> 2 A                9.08           10.3       1.18 
#> 3 A               16.2             9.75     -6.48 
#> 4 A               10.3             9.65     -0.630
#> 5 A               10.5             9.05     -1.47 
#> 6 A               16.9             9.95     -6.91
tail(df)
#> # A tibble: 6 x 4
#>   product weight_actual weight_expected difference
#>   <chr>           <dbl>           <dbl>      <dbl>
#> 1 B                6.72            3.74     -2.98 
#> 2 B                2.80            4.40      1.60 
#> 3 B                8.37            4.12     -4.25 
#> 4 B                7.07            3.75     -3.32 
#> 5 B                3.53            3.88      0.349
#> 6 B                1.95            3.76      1.82

Based on the variable difference and by taking the product group into account, how do I mutate a variable, say, outlier, which will take the value "yes" if the corresponding value is an outlier otherwise "no"?

How do you want to define what an outlier is? Without that, it is hard to give any specific advice.

@FJCC Thank you for your reply. I understand that there is no rigid mathematical definition of what constitutes an outlier. So, I should have been more specific. But, I am not also sure what method would be appropriate. For learning's sake, suppose that I am looking for outliers based on Tukey's method.

Just to clarify, I am trying to find out whether weight_actual and weight_expected differ significantly. As a beginner, I am still trying to find a way to tackle this issue. What do you suggest?

I show a way to use 1.5 * IQR to flag points outside of those limits in the code below. However, I would be very cautious about eliminating data on the basis of such a test. As the data in your reprex show, the differences calculated from normally distributed data with only 50 points can exceed these limits. In a large normally distributed data set, there will almost always be data beyond 1.5 * IQR. And your data may not be normally distributed. My own practice is to keep data unless a definite problem is known (e. g. an instrument is found to be out of calibration) or the data are physically extremely implausible (e. g. a person who weighed 50 kg last month is recorded as weighing 100 kg this month).

library(dplyr)
library(ggplot2)
set.seed(123)

# Toy data
weight_actual_a = rnorm(50, mean = 10, sd = 4)
weight_actual_b = rnorm(50, mean = 4, sd = 2)
weight_expected_a = rnorm(50, mean = 10, sd = 1)
weight_expected_b = rnorm(50, mean = 4, sd = .2)

df <- tibble(
  product = rep(c("A", "B"), each = 50),
  weight_actual = c(weight_actual_a, weight_actual_b),
  weight_expected = c(weight_expected_a, weight_expected_b)
) %>% 
  # difference between actual and expected weights
  mutate(difference = weight_expected - weight_actual)

OutlierCalc <- df %>% group_by(product) %>% 
  summarize(UL = quantile(difference, 0.75) + (quantile(difference, 0.75) - quantile(difference, 0.25)) * 1.5,
            LL = quantile(difference, 0.25) - (quantile(difference, 0.75) - quantile(difference, 0.25)) * 1.5)
df <- inner_join(df, OutlierCalc, by = "product")

df <- df %>% mutate(Outlier = ifelse(difference < LL | difference > UL, "Yes", "No"))

ggplot(df, aes(x = difference, fill = Outlier)) + geom_histogram(color = "white") +
  facet_wrap(~product, nrow = 1)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2022-04-25 by the reprex package (v0.2.1)

1 Like

@FJCC Many thanks for the suggestions and also for teaching me some cool tricks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.