Quicker Elementwise Matrix-Multiplications

GreyMerchant · September 12, 2020, 8:10pm

Hello,

Currently I have really large matrices that I need to apply simple elementwise matrix-multiplications on as below:

C <- A * B

The problem I am having is that each matrix is roughly 120 cols x 350 000 rows each which amounts to ~ 42 million cells about. Is there a quicker way to accomplish this task than the base operator? Would I be able to gain a lot more speed with C++ or such?

technocrat · September 12, 2020, 8:49pm

See bigmemory

elmstedt · September 12, 2020, 9:54pm

Do you have example code for this? I took a quick look into {bigmemory} and in 15 minutes of looking didn't see a way to use it to facilitate element-wise multiplication for matrices. I don't see a big.matrix method for * in the {bigmemory}, {biganalytics}, or {bigtabulate} packages. {bigalgebra} hasn't been updated in about 5 years and has been removed from CRAN.

technocrat · September 12, 2020, 10:10pm

bigalgebra was removed in January 2020 after its last update in December 2019, so it's uncertain what that portends for the future.

Most of my linear algebra was lost during a move from Columbus to Amherst in 1971, so I'm probably off. Isn't

daxpy(A=1, X, Y)

the same as

X * Y

with the default scaling value of 1 for A and X?

nirgrahamuk · September 12, 2020, 10:22pm

Rfast claims to use parallelism...

mat.mult(x, y)
Crossprod(x,y)
Tcrossprod(x,y)
Arguments
x A numerical matrix.
y A numerical matrix.
Details
The functions performs matrix multiplication, croos product and transpose cross product. There are
faster(!) than R’s function for large matrices. Depending on the computer, maybe higher dimensions
are required for the function to make a difference. The function runs in parallel in C++.

elmstedt · September 13, 2020, 12:18am

daxpy() performs a * x + y not elements-wise multiplication.

technocrat · September 13, 2020, 12:20am

Thanks for the reality therapy

elmstedt · September 13, 2020, 12:21am

How much quicker do you need the element-wise multiplication to be?

On my system it takes on the order of 1/8 of a second to do this operation,

n <- 3.5e5
p <- 120
set.seed(123) 
x <- matrix(runif(n * p), nrow = n)
y <- matrix(runif(n * p), nrow = n)
format(object.size(x), "MB")
#> [1] "320.4 Mb"
library(microbenchmark)
mbr <- microbenchmark(
  R   = x * y
)
mbr
#> Unit: milliseconds
#>  expr      min       lq     mean median       uq     max neval
#>     R 106.9622 108.3356 124.0534 123.71 137.5286 153.561   100

^{Created on 2020-09-12 by the reprex package (v0.3.0)}

How many times are you doing this element-wise matrix multiplication?

Even if the average time to do a single multiplication was 1 ns, doing 42 million of these multiplications would take 42 milliseconds a three-fold increase in speed over the {base} solution.

I do have C++ code which is a little faster than {base} at the expense of overwriting one of the input matrices. I'm on mobile right now, so I'll update this post with that code when I get home.

EDIT: Here is the faster C++ code,

library(microbenchmark)
n <- 3.5e5
p <- 120
set.seed(123) 
.x <-  matrix(runif(n * p), nrow = n)
.y <-  matrix(runif(n * p), nrow = n)

This code is adapted from a solution found here:
https://stackoverflow.com/a/19920823
I've removed the creation of a new storage object to facilitate speeding up the function call.

Rcpp::cppFunction("NumericVector multMat(NumericMatrix m1, NumericVector m2) {
  // modified from code found at:
  // https://stackoverflow.com/a/19920823
  m2 = m1 * m2;
  return m2;
}")

(mbr <- microbenchmark(
  R     = x * y,
  RCPP  = multMat(x, y),
  times = 100,
  setup =  y <- .y
))
#> Unit: milliseconds
#>  expr      min        lq     mean    median        uq      max neval
#>     R 106.8269 108.56660 124.9005 125.45150 138.30920 169.7719   100
#>  RCPP  58.9489  59.62685  60.8673  60.10355  61.40645  73.7025   100

^{Created on 2020-09-12 by the reprex package (v0.3.0)}
A word of warning, the y object will be overwritten by the output object using this code. So it is only useful if you no longer need the original matrix after performing the multiplication. I had to include the setup = argument in the microbenchmark() call to reset the y object to its original state.

elmstedt · September 13, 2020, 8:49am

Okay, because I am a bit depraved I couldn't really let this drop. Here is the best I could do for element-wise matrix multiplication.

Doing the matrix multiplication in a nested OMP for loop in FORTRAN I was able to realize a 6x improvement in speed. This is still destructive to the first matrix passed in, but hey, you wanted speed, and this is the fastest I could do.

It's possible you might be able to get faster results doing the same thing in C /C++, but R and FORTRAN matrices are column-major and one-indexed while C/C++ are row-major and zero indexed, and I didn't want to deal with that nonsense. I put it together as a tiny package if anyone wants to have a go at improving it.

If you're on Windows you'll need to have Rtools installed. Linux should have everything you need out of the box. On OSX (assuming you're on R 4.0.0 or later) you'll need Apple Xcode 10.1 and GNU Fortran 8.2, if you're on an earlier version of R you're on your own.

devtools::install_github("elmstedt/elmult", force = TRUE)
library(microbenchmark)
library(elmult)
n <- 3.5e5
p <- 120
set.seed(123)
.a <- matrix(runif(n * p), nrow = n)
b <- matrix(runif(n * p), nrow = n)

microbenchmark(
  R       = a * b,
  FORTRAN = em(a, b),
  setup   = a <- .a * 1,
  check   = "identical"
)
#> Unit: milliseconds
#>     expr      min        lq      mean    median       uq      max neval
#>        R 102.9640 104.64525 129.69537 106.65625 140.4574 207.5471   100
#>  FORTRAN  20.0684  20.26365  20.48035  20.36785  20.6380  21.7892   100

Your results will vary though as my machine has an Intel 7800X cpu with 6 hyper-threaded cores for 12 threads total. For the OMP, I set it to use 11 threads. Some other number of threads may be more optimal (for reference 2 threads was getting me about 30ms, 4 threads was about 20 with very little improvement beyond that). The way the multi-threading is setup is to iterate through columns first, this is beneficial since the data is stored column-major, so we are accessing memory sequentially rather than quasi-randomly.

^{Created on 2020-09-13 by the reprex package (v0.3.0)}

EDIT: Also, you should remove the {dplyr} tag as nothing in your question or in any of the answers is related to {dplyr}.

GreyMerchant · September 14, 2020, 10:34am

Hello @elmstedt,

Thanks for the detailed reply! I also get similiar run times on that example data. My actual data is similarly sized but the run time turns into a couple of seconds and not as fast as above in your example. I am not entirely yet sure for that reason (might be because it is xxxx.yyyy sized numbers?)

I am quite okay with overwriting at this stage as space is also becoming a bit of a dire concern at the moment. I am going to give your code a run and let you know

GreyMerchant · September 14, 2020, 10:35am

Thanks for the follow up! I will give this a go as well. I am using Windows still at the moment. For this I will finally have it in the backend of Shiny so not sure in terms of thread count etc what happens there. I am hopeful it will be faster than this silly laptop.

(I have removed the dplyr tag )

system · September 21, 2020, 10:36am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.