I am trying to work out a conditional sum for each column of a massive sparse matrix, but I am encountering memory issues. Column summing seems to work okay when there are no conditions, but if I try and apply a condition of >= 1 then there are memory issues, which I believe is driven by attempts to replace values in the sparse matrix.
I need this to both be as fast as possible and avoid the memory issue... is this possible? In my actual internal R package, there are other calculations that take place, and speed is extremely important for the users of the package.
Reproducible example below:
#Matrices - a little slow to run
a <- Matrix::rsparsematrix(2000000, 30, 0.5)
b <- Matrix::rsparsematrix(200, 30, 0.8)
c <- Matrix::tcrossprod(b, a)
#Works
colsums <- Matrix::colSums(c)
#Doesn't work - not enough memory
colsums_1plus <- Matrix::colSums(c >= 1)
#> Error: cannot allocate vector of size 3.0 Gb
I've played around with a few different approaches and so far no luck. I've tried using different matrix types and I have also tried splitting the large matrix in to chunks and doing the calculations in parallel using furrr::future_map, but that was either the same outcome or some parts ran whilst others didn't and overall the time taken was significantly higher!
Thanks in advance for any help with this.