## title: "Area Under the Curve Speed Test"

# Benchmarking Area Under the Curve

## Background

I had a very similar problem only last month. I created this **R Markdown** Notebook to track and test answers from a recent discussion on StackOverflow, where I asked:

I want to estimate the likelihood that a randomly-selected item from one group will have a higher score than a randomly-selected item from a different group. That is, the *Probability of Superiority*, sometimes called the *Common-language Effect Size*. See for example: https://rpsychologist.com/d3/cohend/. This can be resolved algebraically if we accept that the data are normally distributed *McGraw and Wong* (1992, Psychological Bulletin, 111, doi: 10.1037/0033-2909.111.2.361), but I know that my data are not normally distributed, making such an estimate unreliable.

The resulting discussion produced several alternatives, most of which were a great deal better than my first efforts.

This file runs benchmarking tests to see the relative speed at which each algorithm works with different sample sizes.

# Setup

### Load required libraries

```
# for analysis
library(data.table)
library(bigstatsr)
# for plotting
library(ggplot2) # pretty plots
library(scales) # commas in large numbers
# for recording & comparing runtimes
library(microbenchmark)
# for output to PDF & HTML
library(tinytex)
options(width = 90)
```

## Functions to test

Eight alternative functions are benchmarked against each other and compared using a prespecified number of calculations with 100 iterations. Summary tables are presented for absolute and relative times. We then present these results with violin plots.

### Functions are:

- My Original For-If-Else function
- Bring data together with expandgrid
- "CJ" (C)ross (J)oin. A data.table is formed from the cross product of the vectors
- Modified expandgrid function using similar sum routine as CJ
- DataTable cartesian product of data.table
- Wilcox from base R
- Modified Wilcox from base R
- BaseR AUC, taking advantage of properties of matrix outer product
- AUC from BigStatsR

#### My Original For-If-Else function

```
MyForIfElse.function <- function(){
bigger <- 0
equal <- 0.0
smaller <- 0
for (i in alpha) {
for (j in beta) {
if (i > j) {bigger <- bigger + 1}
else
if (i == j) {equal <- equal + 0.5}
}
}
PaGTb <- (bigger + equal) / nPairs
PaGTb
}
```

#### Expand.grid function

```
Myexpandgrid.function <- function(){
# My second attempt
c <- expand.grid(a = alpha, b = beta)
aGTb <- length(which(c$a > c$b))
aEQb <- length(which(c$a == c$b))
aGTb <- aGTb + (0.5 * aEQb)
PaGTb <- aGTb / nPairs
PaGTb
}
```

#### Cross Join function

```
MyCJ.function <- function(){
data.table::CJ(a = alpha, b = beta, sorted = F)[, {
aGTb = sum(a > b)
aEQb = sum(a == b)
aGTb = aGTb + 0.5 * aEQb
PaGTb = aGTb / nPairs #data.table returns last value in {}
} ]
}
```

#### bigstatsr AUC function

```
MyAUC.function <- function() {
MyAUC <-
bigstatsr::AUC(c(alpha, beta), rep(1:0, c(length(alpha), length(beta))))
MyAUC
}
```

#### DataTable function

```
MyDT.function <- function() {
DTa <- data.table(a = alpha)[, .N, keyby = a]
DTb <- data.table(b = beta)[, .N, keyby = b]
MyDT <- (DTb[DTa, on = .(b < a), allow.cartesian = TRUE,
nomatch = 0L, sum(N*i.N)] +
0.5 * DTb[DTa, on = .(b = a), allow.cartesian = TRUE,
nomatch = 0L, sum(N*i.N)]) / nPairs
MyDT
}
```

#### Wilcoxon test function

```
MyWilcox.function <- function() {
MyWilcox <- wilcox.test(alpha, beta)$statistic / nPairs
MyWilcox
}
```

#### Modified Wilcox

```
wilcox_AUC <- function(){
r <- rank(c(alpha, beta))
n.x <- as.integer(length(alpha))
n.y <- as.integer(length(beta))
{sum(r[seq_along(alpha)]) - n.x * (n.x + 1) / 2} / n.x / n.y
}
```

#### Base AUC

```
AUC2 <- function() {
# tabulate does not include 0, therefore we need to
# normalize to positive values
m <- min(c(alpha, beta))
A_Freq = tabulate(alpha - min(m) + 1)
B_Freq = tabulate(beta - min(m) + 1)
# calculate the outer product.
out_matrix <- outer(A_Freq, B_Freq)
bigger = sum(out_matrix[lower.tri(out_matrix)])
equal = 0.5 * sum(diag(out_matrix))
(bigger + equal) / length(alpha) / length(beta)
}
```

## Benchmark all of the Functions

```
MyBenchmark.function <- function(x) {
mbm <<- microbenchmark(
"ForIfElse" = {MyForIfElse.function()},
"expandgrid" = {Myexpandgrid.function()},
"CrossJoin" = {MyCJ.function()},
"CartesianDT" = {MyDT.function()},
"Wilcox" = {MyWilcox.function()},
"BaseWilcox" = {wilcox_AUC()},
"BigStatsAUC" = {MyAUC.function()},
"BaseAUC" = {AUC2()},
times = x,
unit = "ns"
)
}
```

# Charting functions

## Function to visualise the data distributions

```
MyDistPlot <- function(){
a <- data.frame(val = alpha, id = "alpha")
b <- data.frame(val = beta, id = "beta")
c <- rbind(a,b)
p2 <- ggplot(data = c, aes(x = val, group = id, fill = id)) +
geom_density(adjust = 1.5, alpha = .2) +
theme_classic() +
labs(title = NULL ,
subtitle = paste("Sample sizes = ", MyRows),
caption = paste("AUC = Alpha:", round(PaGTb, 2) * 100,
", Beta:", round(1 - PaGTb, 2) * 100,
"
Note truncated x-axis: Values can range ± 30")
) +
theme(legend.position = c(0.9, 0.9)) +
xlim(-8, 8)
p2
}
```

## Function to plot results of simulation

```
MyPlot.function <- function(){
pl <- ggplot(mbm, aes(expr, time)) +
geom_violin(scale = "width", trim = TRUE,
fill = "blue", alpha = 0.1,
colour = "#999999") +
geom_jitter(height = 0, width = 0.2,
alpha = 0.1,
colour = "navy") +
theme_classic() +
scale_y_log10() +
labs(title = "Speed-test: alternative Probability of Superiority calculations",
subtitle = paste(comma(MyRows), " x ",
comma(MyRows), " = ",
comma(nPairs), "comparisons"),
y = "Nanoseconds (log10)" , x = NULL ) +
coord_flip()
pl
}
```

# Generate some artificial data

## Generating data

In this example, two non-normal data sets are compared, so it's unclear which should have higher probability of superiority. Group Alpha is more platykurtic (spread out), with many large negative and positive values, while group Beta is a $\chi$$^2$ distribution (df=1), so highly skewed.

```
# for reproducability
set.seed(42)
# shifted normal distribution cubed
MakeAlpha <- function(){rnorm(MyRows, mean = 0, sd = 1)**3}
# normal distribution squared and shifted
MakeBeta <- function(){rnorm(MyRows, mean = 0, sd = 1)**2 - 0.6}
```

# Run the simulation

#### Change the number in MyRows for different sample sizes

Note: Large sample size can take a long time.

```
MyRows <- 25
```

Complete simulation is run 100 times.

*(Reduce the number for a faster result)*

```
alpha <- MakeAlpha()
beta <- MakeBeta()
nPairs <- length(alpha) * length(beta)
# For interest, we calculate the sample Probability of Superiority
# (Area Under the Curve)
PaGTb <- AUC2()
MyDistPlot()
```

## Print summary results and visualise benchmarks

```
MyBenchmark.function(100)
# print comparison speed test table
print(mbm, unit = "ms")
print(mbm, unit = "relative")
# Visualise the comparison
MyPlot.function()
```

### Interpretation of small sample experiment

The original ForIfElse algorithm seems to work about as well as, or better than, the other options.

The Modified Wilcox and BaseR_AUC algorithm seem to perform best most often.

# Rerun with larger sample size

```
MyRows <- 1000
set.seed(42)
alpha <- MakeAlpha()
beta <- MakeBeta()
nPairs <- length(alpha) * length(beta)
PaGTb <- MyAUC.function()
MyDistPlot()
MyBenchmark.function(100)
# print comparison benchmark tables
print(mbm, unit = "ms")
print(mbm, unit = "relative")
MyPlot.function()
```

### Interpretation of larger sample experiment

The Modified Wilcox, BigStatsR AUC and BaseR AUC algorithm seem to perform best most often.

ForIfElse alfgorithm is among the poorer performers now

# Rerun with much larger sample size

#### This can take some time ... You may want to go make a cup of tea while it's running

```
MyRows <- 5000
set.seed(42)
alpha <- MakeAlpha()
beta <- MakeBeta()
nPairs <- length(alpha) * length(beta)
PaGTb <- MyAUC.function()
MyDistPlot()
MyBenchmark.function(100)
# print comparison benchmark tables
print(mbm, unit = "ms")
print(mbm, unit = "relative")
MyPlot.function()
```

### Interpretation of larger sample experiment

The BaseR_AUC, followed by Modified Wilcox and BigStatsR AUC seem to perform best most often.

BaseR_AUC functions at about 5000 times the speed of the ForIfElse algorithm.

# Conclusions

The *Area Under the Curve (AUC)* function from the package, **bigstatsr** *(Statistical Tools for Filebacked Big Matrices)* by **Florian Privé** (Grenoble, France) was extraordinarily efficient.

The two close winners were created by **Cole**, who prefers to remain anonymous, with significant input from **chinsoon12** (Singapore). These were:

- a Wilcox test extracted from the original R-code to focus only on the single task, rather than a general statistical routine.
- an AUC function that relies on table()

Both algorithms are attractive because they rely only on **Base R**.