I have been struggling with a bio computer lab teaching assignment on Posit.cloud that had worked fine in past years, but is now taking much longer time to process than I expected. It seems to be a slowdown in general linear models (glm) with multiple variables. On my laptop using 2 variables (y ~ x) vs 10 variables (y ~ x + z +...) takes about 2x as long, but on posit.cloud it takes about 15x as long. Any insight into why this is happening? It is not a memory limitation, I am using 16GB on both my own computer and posit.cloud. Here is the test code I have been using:
library(MASS)
## generate random dataset
# --- Configuration for the 10-variable dataset ---
n_obs <- 1500 # Number of observations (rows)
n_vars <- 10 # Number of variables (columns)
# 1. Define the mean vector (mu)
# We can create a vector of means, for example, 10, 20, 30, ... 100
mu <- seq(10, 100, by = 10)
print(paste("Mean vector:", paste(mu, collapse=", ")))
# 2. Define the covariance matrix (Sigma)
# Creating a 10x10 manual matrix is tedious and error-prone.
# A simple way to create a valid, symmetric, positive-definite matrix
# is to use the outer product of some random numbers and add variance to the diagonal.
# Generate a random correlation structure
set.seed(123) # for reproducibility
rand_matrix <- matrix(rnorm(n_vars^2, mean = 0, sd = 0.5), nrow = n_vars)
# Create a symmetric matrix
Sigma_base <- t(rand_matrix) %*% rand_matrix
# Add variance to the diagonal to ensure positive-definiteness
diag(Sigma_base) <- diag(Sigma_base) + 1.5
##Random dataset
dataset <- mvrnorm(n = n_obs, mu = mu, Sigma = Sigma_base)
## benchmarking package
#install.packages("rbenchmark")
library(rbenchmark)
##compare glm with 2-10 variables
# Define functions or code blocks to benchmark
benchmark(glm(dataset[,1]~dataset[,2]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5] + dataset[,6] + dataset[,7] + dataset[,8]+ dataset[,9] + dataset[,10] ),
replications = 1000, # Number of times to repeat each expression
columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")
)
benchmark(glm(dataset[,1]~dataset[,2]),
glm(dataset[,1]~dataset[,2] + dataset[,3]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5] + dataset[,6]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5] + dataset[,6] + dataset[,7]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5] + dataset[,6] + dataset[,7] + dataset[,8]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5] + dataset[,6] + dataset[,7] + dataset[,8]+ dataset[,9]),
glm(dataset[,1]~dataset[,2] + dataset[,3] + dataset[,4] + dataset[,5] + dataset[,6] + dataset[,7] + dataset[,8]+ dataset[,9] + dataset[,10] ),
replications = 100, # Number of times to repeat each expression
columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")
)
Benchmark results on M2 Mac with 16 GB:
replications elapsed relative user.self sys.self
9 1000 1.452 2.388 1.358 0.091
8 1000 1.332 2.191 1.252 0.081
7 1000 1.203 1.979 1.130 0.074
6 1000 1.108 1.822 1.032 0.077
5 1000 1.003 1.650 0.927 0.076
4 1000 0.910 1.497 0.841 0.069
3 1000 0.789 1.298 0.726 0.063
2 1000 0.708 1.164 0.645 0.064
1 1000 0.608 1.000 0.554 0.055
Benchmark on Posit.cloud with 4 cores and 16GB:
replications elapsed relative user.self sys.self
9 100 6.898 48.922 1.207 5.711
8 100 7.275 51.596 1.341 5.949
7 100 7.465 52.943 1.174 6.339
6 100 7.801 55.326 1.081 6.774
5 100 0.191 1.355 0.163 0.025
4 100 0.177 1.255 0.167 0.009
3 100 0.156 1.106 0.147 0.010
2 100 0.147 1.043 0.136 0.010
1 100 0.141 1.000 0.117 0.020