I'm trying to perform a large matrix cross product and a bunch of other calculations following that, but it's slower than I'd like. I feel like there must be a way to utilise parallel processing to speed the task up, but I don't know if the overhead associated with parallel processing makes this task doomed from the start.
I know there are alternative BLAS libraries to the default one that comes with R, that are great for this sort of thing (Intel MKL, OpenBLAS) ,but I don't really understand how to install and set those up and I don't even know if they apply to Windows... the tutorials I've seen for setting them up on Windows are old and don't seem to work for me...so I'm trying to use futures and the furrr package in R to obtain speed improvements, but failing.
Below is a reproducible example of the time difference between a standard and a parallel matrix cross product. The parallel version is actually slower.
#Matrices - a little slow to run the first line
a <- Matrix::rsparsematrix(2000000, 30, 0.5)
b <- Matrix::rsparsematrix(200, 30, 0.8)
#Packages needed
library(tictoc)
library(future)
#Standard call - takes a good few seconds--------------------
tic()
test1 <- Matrix::tcrossprod(a, b)
toc()
#> 35.21 sec elapsed
##Parallel approach------------------------------------------
#Split data up in to 4 chunks
nrows <- seq_len(nrow(a))
a_chunked <- split.data.frame(a, cut(nrows, pretty(nrows, 4)))
#Run each chunk in parallel
plan(multisession)
tic()
test2 <- furrr::future_map(a_chunked, ~Matrix::tcrossprod(., b),
.options = furrr::furrr_options(seed = NULL))
toc()
#> 49.5 sec elapsed
In reality, I would be doing subsequent calculations on the matrix after the cross product, and in the end the code would return a small data frame of results, which would be returned by each "worker", rather than a large sparse matrix like in the above example. Therefore the amount of overhead for the data being sent off to each worker may be high, but the amount of data being sent back by each worked wouldn't.
Am I approaching this in the completely wrong way, or is there not much I can really do to get speed improvements here without an alternative BLAS library?