cuBLAS sgetrf_batched on 100*100 matrices SLOW

Hi, I am using cuBLAS sgetrf_batched and sgetrs_batched to solve hundreds of thousands Ax=b equations, with each A of 100100 dimension and b 1001.

Each A is symmetric positive definite. cuBLAS sgetrf_batched gives me ~30 G flops on K40 which seems to be slow. Is the low flops what I should expect, and shall I try to use something else?

PS:

  1. I tried sgetrf in magma, a bit slower.
  2. batched sposv in magma seems unstable to my As – I have to make A “much more definitive” (by adding a positive number to the diagonal of As) to make it stable.

Thanks for your suggestions.

There is a CUDA batch solver sample code available on the registered developer site:

https://developer.nvidia.com/accelerated-computing-developer-program-home

https://developer.nvidia.com/rdp/assets/cuda-batched-solver-tgz

"CUDA Batch Solver (Updated June 2013)

This code provides an efficient solver and matrix inversion for small matrices, using partial pivoting."

I haven’t used it, so I can’t give you much advice about it. It may be worth a look.

Hi, thanks for your reply. I downloaded the batch solver code and its description says:

“For GPU architectures >= sm_20 dsolve_batch can handle systems up to dimensions 76”

I think I cannot use it out-of-box because I need to handle dimension up to several hundred.

Sorry, I wasn’t aware of that limitation.

The batched solver code downloadable from the CUDA registered developer website was a precursor to the batch support added to CUBLAS. The many downloads of that original code motivated productization of batched interfaces. Unless programmers have a need for source code, e.g. to add custom functionality, it should not be used any more at this time.

You seem to be using single precision data. Note that there is no single-precision solver in the downloadable package, i.e., no function ssolve_batch(). You could make your own ssolve_batch() function following the pattern of the existing double-precision solver and it should be able to handle matrices up to about dimension 108x108 on a K40.

I don’t recall what exact performance expectations one should have for a K40, but 30 GFLOPS does not seem totally out of line. By their nature, solvers contain both sequential and parallel operations, they are not fully parallel like a matrix multiply. The general issue when processing small matrices is that they do not expose enough parallelism to keep the GPU busy, which needs to run on the order of 10,000 threads for good performance.

One can introduce batch processing to get around it, but this then takes many registers or lots of shared memory to make it fast. Around a matrix size of 10x10 one runs out of registers, around size 100x100 one runs out of shared memory (e.g. 76x76 when the matrix elements are ‘double’).

There is a gap between that size and the minimum size that can be handled efficiently by non-batched operations (around 256x256, I think, but my memory is hazy). The gap can be filled by various “hybrid” methods as best one can, but the performance will not necessarily great.

If this is an important use case (i.e. significant impact on overall application run time), you may want to file an RFE (request for enhancement) with NVIDIA to further optimize the batched APIs for this range of matrix sizes. RFEs can be filed via the bug reporting form, simply prefix the synopsis with “RFE:” to mark it as such.


As seen from the testing script from magma, solving many 100*100 matrices using
cublas_sgetrf_batched is ~30 gflops, which is low. Magma seems not much better.

Any suggestion on which to use to solve this kind of problem, i.e., many Ax=b where
A is of 100*100?

Thanks

The fact that two independently developed implementations give roughly the same performance more or less tells us that a solver handling matrices of this size is a hard challenge, and that there is no easy way (or maybe even hard way) to achieve significantly more performance for this particular task. This kind of operation on this size matrix is just not a good natural fit for the GPU.

Depending on the context of these operations, I would suggest trying to run them on the host, that is, the CPU. Low-latency, high-throughput, caches work great for matrices of this size. You will need a high-performance library, such as MKL, and probably recent hardware (Haswell or later) for such an attempt to be worthwhile. It may not be faster than your current GPU solution in the end, but it seems worth exploring.