cuSolver

Hi all,

I’ve been benchmarking cuSolver performance for SVD for matrix sizes above and below 32.

On an H100 GPU the batched API appears to scale nicely in performance reaching in the order of a million SVD operations per second below for e.g. 16x16 and 32x32 matrices for large batches of tousands of matrices.

However in the size range above 32 one can only use the unbatched SVD API of cuSolver, and there I do not observe any significant speed gains over even a single core of a CPU, compared to BDCSVD of the Eigen C++ library for the range 33-128

Wouldn’t this point toward a problem with the arbitrary size limit of 32 for batched operations in cuSolver? At least for SVD this limit would have to be significantly higher in order to make cuSolver useful (from a performance stand point) for relatively small matrices in the range 33-128

Now I do see reasons for having this size limit at 32. For one, maybe the code may have been created with the concept of doing warp synchronous programming in mind, operating on all columns or rows of matrices simultaneously. Or maybe work matrices are kept in shared memory for optimal performance

I think exploring options to introduce parallelism for somewhat larger matrices should be investigated, given that there is such a large performance “blind spot” where it just does not currently make sense to do SVD with cuSolver. The horizontal axis is batch size, the y axis is SVDs per second.

cusolver questions should generally be posted over here. I can move it if you wish. I also think it might help spur the discussion if you provided your test harness. An immediate/obvious question from me would be what does the >32 work submission look like. You indicate:

So what does this mean in that context:

Are you making use of streams at all to submit work (in the non-batched case)?

I run a for loop over the provided batch (at the application level) for matrices above size 32 and call into the unbatched API of cuSolver, just like for example libTorch and TensorFlow do.

This explains the slight dip for small batches of <8 because work buffers have to be allocated just once per batch and that overhead gets averaged.

I did try streams, but that was surprisingly difficult because the API calls are not asynchronous. So I had to use both multithreading and streams to make it work - and then I only got about a 2-2.5x improvement. I suspect the amount of registers or shared memory used by the kernels did not allow for more parallelism.

Hi Christian,

would the batched approximate SVD (cusolverDnDgesvdaStridedBatched)( 1. Introduction — cuSOLVER 13.0 documentation ) be an option for you?

Regards,

Christoph