Hello,
I'm creating a matrix, representing a certain value for a combination of customers (rows) and features (cols). Let's call this matrix NBA. Based on data received through API, this matrix needs to be updated many times each second, inserting new values for each call with m[x,y] <- new_value. Subsequently, some matrix operations are carried out (not important here).
The matrix is part of a R6 object as a private field, and a update_matrix
method allows updating a certain cell of the matrix. However, this operation is very slow compared to updating a normal matrix object outside R6, on the order of microseconds instead of nanoseconds.
Reprex:
library(bench)
library(ggplot2)
library(tidyr)
# Create NBA matrix, sparse
no_customers <- 1e6
no_features <- 30
NBA_matrix <-
matrix(
sample(c(rep(0, 1000), 1), size = no_customers * no_features, replace = TRUE),
nrow = no_customers,
ncol = no_features
)
# Create NBA_like R6 object with matrix
library(R6)
NBA_lite <- R6Class("NBA_lite",
public = list(
mm = NULL,
initialize = function(input_matrix) self$mm <- input_matrix,
get_matrix = function() self$mm,
modify = function(row, col, value) self$mm[row,col] <- value
)
)
new_NBA_lite <- NBA_lite$new(input_matrix = NBA_matrix)
# Benchmark modifying single value, matrix vs R6 field
bench::mark(matrix = NBA_matrix[234123, 10] <- 2,
R6_field = new_NBA_lite$modify(row = 234123, col = 10, value = 2))
expression | median | total_time |
---|---|---|
NBA_matrix[234123, 10] <- 2 | 804ns | 8.84ms |
new_NBA_lite$modify(row = 234123, col = 10, value = 2) | 126ms | 125.66ms |
So, obviously, there's some overhead in using R6, but probably not on the order of 125ms. Am I missing something in relation to using a (semi)-large matrix as a field in terms of copy-semantics?
Thanks for any suggestions,
JW