How to eliminate outlier trajectories from a dataset?

Hello everyone,

I would like to find a way to eliminate outliers that are beyond the 25th and 75th percentile of my dataset. The issue is that each row of my dataset represents a trajectory, and I would like to remove not only singular values, but a whole trajectory that, at least at one point along its duration (colummns F11 to F110), is considered outlier.

Here is a sample of my data:

dput(donneesCVaf)
structure(list(C = c("w", "w", "w", "w", "w", "l", "l", "l",
"w", "w"), F10 = c(858L, 831L, 614L, 802L, 782L, 472L, 449L,
629L, 560L, 565L), F11 = c(864L, 825L, 615L, 750L, 738L, 446L,
454L, 510L, 565L, 567L), F12 = c(872L, 812L, 618L, 654L, 680L,
430L, 453L, 474L, 556L, 558L), F13 = c(898L, 772L, 621L, 563L,
642L, 428L, 457L, 472L, 561L, 544L), F14 = c(853L, 718L, 621L,
529L, 625L, 438L, 452L, 481L, 558L, 531L), F15 = c(691L, 677L,
617L, 515L, 626L, 482L, 465L, 491L, 543L, 519L), F16 = c(642L,
642L, 615L, 533L, 576L, 506L, 494L, 503L, 569L, 512L), F17 = c(639L,
619L, 615L, 566L, 611L, 511L, 515L, 512L, 549L, 512L), F18 = c(630L,
603L, 614L, 605L, 627L, 507L, 562L, 576L, 582L, 517L), F19 = c(630L,
590L, 617L, 640L, 630L, 514L, 622L, 610L, 580L, 527L), F110 = c(645L,
579L, 624L, 630L, 606L, 562L, 648L, 673L, 597L, 540L)), row.names = c(NA,
10L), class = "data.frame")

I would appreciate any help! If you have any questions or if I did not express myself clearly, please do not hesitate to ask.

Interpreting the application of the 25/75 test to the entire data frame rather than on a row-by-row basis. The approach is to move from the initial data frame to its subsets of outliers and keepers through a series of what questions—what must be done to bring the initial object closer to the desired object and that involves choosing what function to apply.

# data
d <- data.frame(C = c(
  "w", "w", "w", "w", "w", "l", "l", "l",
  "w", "w"
), F10 = c(
  858, 831, 614, 802, 782, 472, 449,
  629, 560, 565
), F11 = c(
  864, 825, 615, 750, 738, 446,
  454, 510, 565, 567
), F12 = c(
  872, 812, 618, 654, 680,
  430, 453, 474, 556, 558
), F13 = c(
  898, 772, 621, 563,
  642, 428, 457, 472, 561, 544
), F14 = c(
  853, 718, 621,
  529, 625, 438, 452, 481, 558, 531
), F15 = c(
  691, 677,
  617, 515, 626, 482, 465, 491, 543, 519
), F16 = c(
  642,
  642, 615, 533, 576, 506, 494, 503, 569, 512
), F17 = c(
  639,
  619, 615, 566, 611, 511, 515, 512, 549, 512
), F18 = c(
  630,
  603, 614, 605, 627, 507, 562, 576, 582, 517
), F19 = c(
  630,
  590, 617, 640, 630, 514, 622, 610, 580, 527
), F110 = c(
  645,
  579, 624, 630, 606, 562, 648, 673, 597, 540
))

# functions
df_spot_outliers <- function(x) unlist(d[x,2:12]) <= q[1] | unlist(d[x,2:12]) >= q[2]
my_quantile <- function(x) quantile(d[x,2:12], probs = the_probs)
spot_outliers <- function(x) d[x,2:12] <= r[x,1] | d[x,2:12] >= r[x,2]

# main

the_probs = c(0.25,0.75)

# assuming outliers are calculated on a data frame basis

(q <- quantile(unlist(d[2:11]),prob = the_probs))
#>   25%   75% 
#> 513.5 630.0

m <- matrix(nrow = 10, ncol = 11)
for(i in 1:10) m[i,] <- df_spot_outliers(i)
# show quantiles used
q
#>   25%   75% 
#> 513.5 630.0
(outliers <- d[which(rowMeans(m) != 0),])
#>    C F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F110
#> 1  w 858 864 872 898 853 691 642 639 630 630  645
#> 2  w 831 825 812 772 718 677 642 619 603 590  579
#> 4  w 802 750 654 563 529 515 533 566 605 640  630
#> 5  w 782 738 680 642 625 626 576 611 627 630  606
#> 6  l 472 446 430 428 438 482 506 511 507 514  562
#> 7  l 449 454 453 457 452 465 494 515 562 622  648
#> 8  l 629 510 474 472 481 491 503 512 576 610  673
#> 10 w 565 567 558 544 531 519 512 512 517 527  540
q
#>   25%   75% 
#> 513.5 630.0
(keepers <- d[which(rowMeans(m) == 0),])
#>   C F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F110
#> 3 w 614 615 618 621 621 617 615 615 614 617  624
#> 9 w 560 565 556 561 558 543 569 549 582 580  597

Created on 2023-03-18 with reprex v2.0.2

1 Like

Thank you very much, that helped me a lot!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.