cuBLAS or cuSolve strategy to invert many tiny matrices (>1 million 9x9 matrices)

Problem: solve Ax=v for x with A being a 2d square matrix and v a 1d vector.
Typical size of a matrix: 9x9
Typical number of matrices: >=1e6

Current code using cuBLAS Batched getrf and getrs:

  • init:
    • create handle
    • cudaMalloc double A batchSizeNNsizeof(double)
    • fill double ** Aptr and vptrs with addresses of each small 9x9 matrix stored in A and v
  • iterate:
    • setAandV<<numBlocks, numThreads>>(A, v)
    • cublasDgetrfBatched with Aptrs
    • cublasDgetrsBatched with Aptrs and vptrs
    • use results stored in v
  • destroy: free memory and destroy handle

It turns out setting the values in A and v is costly, possibly due to the low-speed of global memory where they are stored.

  • getrf also seems nontrivial, but I guess there is nothing I can do about it.

Any suggestions to improve the strategy?

  • Can we use malloc or new in the thread and store their address in Aptr to be used in cublasDgetrfBatched/`cublasDgetrsBatched’
  • Can we iterate over all cells and call cuSolver’s getrf an getrs for each cell? I feel this would be slow.

It is not clear whether you want to actually invert matrices or need a solver for small dense systems. What is needed for most use cases is the latter, also for numerical reasons.

Assuming the matrices don’t get any larger than 9x9, one potential strategy is to write a self-contained dedicated solver (with partial pivoting if necessary) for each particular matrix size, completely unroll all the loops and convert the matrix elements into scalars. In that scenario, each thread handles one matrix, and it works best if the matrix elements are float, because high register usage might become a performance limiter otherwise.

This used to work reasonably well for me quite a number of years ago, prior to the introduction of batched interfaces in CUBLAS. The advantage is that you are not restricted by the API. You would definitely want to start with an experiment, using the smallest relevant matrix size to see how that holds up performance-wise.

Thank you for your suggestion, njuffa! Yes, I wrote my own solver optimized for my matrix pattern and it was a few times faster. I did not write a generic getrf/getrs solver, though, since I’m not sure how efficient it would be.

I am glad to hear it worked out well. Yes, that is what I meant: Build a specialized solver specifically for you use case, rather than using a generic interface suggested by standard BLAS. Generic interfaces provide the widest applicability, but often this comes at the cost of performance.