Hi all,
I’ve been benchmarking cuSolver performance for SVD for matrix sizes above and below 32.
On an H100 GPU the batched API appears to scale nicely in performance reaching in the order of a million SVD operations per second below for e.g. 16x16 and 32x32 matrices for large batches of tousands of matrices.
However in the size range above 32 one can only use the unbatched SVD API of cuSolver, and there I do not observe any significant speed gains over even a single core of a CPU, compared to BDCSVD of the Eigen C++ library for the range 33-128
Wouldn’t this point toward a problem with the arbitrary size limit of 32 for batched operations in cuSolver? At least for SVD this limit would have to be significantly higher in order to make cuSolver useful (from a performance stand point) for relatively small matrices in the range 33-128
Now I do see reasons for having this size limit at 32. For one, maybe the code may have been created with the concept of doing warp synchronous programming in mind, operating on all columns or rows of matrices simultaneously. Or maybe work matrices are kept in shared memory for optimal performance
I think exploring options to introduce parallelism for somewhat larger matrices should be investigated, given that there is such a large performance “blind spot” where it just does not currently make sense to do SVD with cuSolver. The horizontal axis is batch size, the y axis is SVDs per second.
