cuBLAS or cuSolve strategy to invert many tiny matrices (>1 million 9x9 matrices)

frank0734 · June 19, 2020, 8:42pm

Problem: solve Ax=v for x with A being a 2d square matrix and v a 1d vector.
Typical size of a matrix: 9x9
Typical number of matrices: >=1e6

Current code using cuBLAS Batched getrf and getrs:

init:
- create handle
- cudaMalloc double A batchSizeNNsizeof(double)
- fill double ** Aptr and vptrs with addresses of each small 9x9 matrix stored in A and v
iterate:
- setAandV<<numBlocks, numThreads>>(A, v)
- cublasDgetrfBatched with Aptrs
- cublasDgetrsBatched with Aptrs and vptrs
- use results stored in v
destroy: free memory and destroy handle

It turns out setting the values in A and v is costly, possibly due to the low-speed of global memory where they are stored.

getrf also seems nontrivial, but I guess there is nothing I can do about it.

Any suggestions to improve the strategy?

Can we use malloc or new in the thread and store their address in Aptr to be used in cublasDgetrfBatched/`cublasDgetrsBatched’
Can we iterate over all cells and call cuSolver’s getrf an getrs for each cell? I feel this would be slow.

njuffa · June 20, 2020, 2:45am

It is not clear whether you want to actually invert matrices or need a solver for small dense systems. What is needed for most use cases is the latter, also for numerical reasons.

Assuming the matrices don’t get any larger than 9x9, one potential strategy is to write a self-contained dedicated solver (with partial pivoting if necessary) for each particular matrix size, completely unroll all the loops and convert the matrix elements into scalars. In that scenario, each thread handles one matrix, and it works best if the matrix elements are float, because high register usage might become a performance limiter otherwise.

This used to work reasonably well for me quite a number of years ago, prior to the introduction of batched interfaces in CUBLAS. The advantage is that you are not restricted by the API. You would definitely want to start with an experiment, using the smallest relevant matrix size to see how that holds up performance-wise.

frank0734 · June 22, 2020, 5:18pm

Thank you for your suggestion, njuffa! Yes, I wrote my own solver optimized for my matrix pattern and it was a few times faster. I did not write a generic getrf/getrs solver, though, since I’m not sure how efficient it would be.

njuffa · June 22, 2020, 5:23pm

I am glad to hear it worked out well. Yes, that is what I meant: Build a specialized solver specifically for you use case, rather than using a generic interface suggested by standard BLAS. Generic interfaces provide the widest applicability, but often this comes at the cost of performance.

Topic		Replies	Views
CUBLAS: Very low occupancy CUDA Programming and Performance	3	1030	December 11, 2015
cuBLAS sgetrf_batched on 100*100 matrices SLOW GPU-Accelerated Libraries	6	1571	February 29, 2016
CUBlas and very large matrices CUDA Programming and Performance	3	873	September 30, 2019
Having multiple relatively small problems GPU-Accelerated Libraries cublas , cusolver	5	782	April 7, 2022
cublas solving a linear system GPU-Accelerated Libraries	4	2551	June 28, 2017
Some Guidance on optimal approach to batch Matrix Multiply GPU-Accelerated Libraries cuda	0	431	August 11, 2020
cuBlas - MemCopy problem GPU-Accelerated Libraries	1	506	May 14, 2016
CuBLAS Showing Poor Performance CUDA Programming and Performance	6	1176	December 20, 2013
cuBLAS fails when matrix has more than 2^31-1 entries? CUDA Programming and Performance	13	726	October 12, 2021
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5592	November 25, 2009

cuBLAS or cuSolve strategy to invert many tiny matrices (>1 million 9x9 matrices)

Related topics