I have noticed a performance difference between R 3.4.4 and 3.5.0. The issue I'm having is with the base R function unique(). unique() seems to run much faster if the dataset doesn't have a lot of columns that are factors. Below is an example R script that should be reproducible in rocker/tidyverse Docker containers:
suppressMessages(install.packages('microbenchmark', repos='https://cran.rstudio.com/', quiet=TRUE))
suppressPackageStartupMessages({
library(dplyr)
library(microbenchmark)
})
args = commandArgs(trailingOnly=TRUE)
rver <- args[1]
# I want a dataset (data.frame) with repeated rows, with or without factorized columns
get_dat <- function(dataset, multiplier=3, factorize = FALSE) {
dat <- as_tibble(dataset)
if (factorize) {
dat <- mutate_if(dat, function(x) is.character(x) || is.integer(x), as.factor)
}
n_dat <- nrow(dat)
n <- multiplier * n_dat
dat <- sample_n(dat, n, replace=TRUE)
dat <- as.data.frame(dat)
dat
}
# this is where the unique() operation is tested
do_microbenchmark <- function(dataset, multiplier, factorize, msg) {
dat <- get_dat(dataset, multiplier, factorize)
mbdat <- microbenchmark(unique(dat), unit="ms", times=2000L)
cat(paste0(msg, ':\n'))
print(mbdat)
}
cat(paste0('\nUsing R version ', rver, '\n======================\n'))
do_microbenchmark(starwars, 5, TRUE, 'starwars dataset, converted to factors')
do_microbenchmark(starwars, 5, FALSE, 'starwars dataset, not converted to factors')
and running it I get:
$ for rver in 3.4.4 3.5.0; do docker run --rm -v $(pwd):/scratch -w /scratch rocker/tidyverse:$rver Rscript microbench-unique.R $rver; done
##
##Using R version 3.4.4
##======================
##starwars dataset, converted to factors:
##Unit: milliseconds
## expr min lq mean median uq max neval
## unique(dat) 5.4102 5.7713 6.144108 6.00405 6.3163 45.8087 2000
##starwars dataset, not converted to factors:
##Unit: milliseconds
## expr min lq mean median uq max neval
## unique(dat) 5.983 6.44155 6.781395 6.70885 6.9872 11.0138 2000
##
##Using R version 3.5.0
##======================
##starwars dataset, converted to factors:
##Unit: milliseconds
## expr min lq mean median uq max neval
## unique(dat) 16.282 28.0554 36.48758 35.8414 43.9043 111.0781 2000
##starwars dataset, not converted to factors:
##Unit: milliseconds
## expr min lq mean median uq max neval
## unique(dat) 2.0359 2.4058 2.706818 2.54465 2.7514 16.8791 2000
I know about distinct() from dplyr and other ways (e.g. data.table's unique() implementation), but what I'm interested in is information about the change to the base-R unique() implementation. unique() actually is built on duplicated(), and I saw that there was a bugfix to duplicated()/unique() that maybe arrived in R 3.5.0, but I don't know if this performance issue I'm seeing is related to that or something else.
Does anyone know anything about this? Thanks,
Andy