Pro Tip: cuBLAS Strided Batched Matrix Multiply

Originally published at:

There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS) libraries—has been a standard benchmark for computational performance. GEMM is possibly the most optimized and widely used routine in scientific computing. Expert implementations are available for every architecture and quickly achieve the peak performance of…

Very good post Cris. I have been dealing with batch problems since the beginning of CUDA and that's a good thing to see works ongoing on this topic.

Are there any people working on cuSOLVER batch functions ?

Oh yes, there is lots of interest in batched LU (getrf/getrs), QR (geqrf), Cholesky (potrf/potrs), and SVD (gesvd).

MAGMA already has implementations of many of these batched algorithms on GPUs:

Thanks for the info, it's a while since I didn't check magma release notes and at that time there were no batched functions. That's good news !

Hi @disqus_dqT1zVDaih:disqus, what cuSOLVER batch functions are you specifically interested in?

Hi @Mark_Harris:disqus , sorry for the late response. I'm using Cholesky (potrf) factorization and system solver (potrs) to solve w^H = s^H R^(-1) systems where R is a Hermitian positive definie matrix and s a known vector.

This typically done when looking to find the filter vector w in adpative signal processing.

Unluckily for me, matrix size is generally between 64x64 and 256x256, which is not enough to run efficiently on GPU in sequential mode. Using streams parallelism, I'have managed to make the GPU competitive against a multicore CPU. I will check out Magma's functions to see if there are doing better than CUDA streamed functions.


@Mark_Harris:disqus : Is there a road map for supporting different batch linear algebra, especially with different layouts. The strided layout here is useful, but other layouts could help us more. We are interested in GETRF, TRSM, QR etc.

Hi Siva, cuBLAS has batched getrf, getrs and geqrf; are these not sufficient for your needs? If you'd like, you can contact me via email directly and I will share your specific needs with our libraries team and product managers.

Hi Cris,
In my application I need to multiply each slice of a 3D matrix (i.e A[NXNXP]) with a 2D matrix(B[NXN]) and sum up the 2D matrices together to get another matrix(C[NXN]). I can do this by using Dgemm and launch P kernels each calculating the multiplication of a slice of A with B and then add the result to an existing C matrix (it seems that cublasDgemm has atomicAdd inside that lets me add different slices).
Now, instead of launching P kernels, using DgemmStridedBatched if I feed(see below) it with A as a 3D and B, C as 2D matrices and then put strideB=0, strideC =0 and beta = 1, I get an error of "on entry to DEMM paramter number 15 had an illegal value".

cublasDgemmStridedBatched(cnpHandle, CUBLAS_OP_N, CUBLAS_OP_N, WIDTH, WIDTH, WIDTH, d_alpha, d_A, WIDTH, WIDTH*WIDTH,
d_B, WIDTH, 0, d_beta, d_C, WIDTH, 0, P);
Does C have to be a 3D matrix and after filling it I should call another kernel and sum the slices of C?


Hi Soroush,

It turns out CUDA 8.0 disallowed certain strideC values, but this is eliminated in CUDA 9.0 (RC).

That said, what you are proposing won't work -- there's no reduction between output matrices that is implied. Setting strideC = 0 means that all output matrices will just be overwriting each other. Your method of using gemms works not because there are atomicAdds, but because the sequence of gemms is serialized (red line in Figure 1).

To accomplish what you describe, I would call a gemv(A, vector of 1s, A'), then a gemm(A', B, C) with the appropriate strides and flattenings.

Thanks Cris!
I have a quick question. If I use cublassDgemmStridedBatched in a kernel and want to launch many kernels (i.e. 100000 times), then does each kernel call create a grid when reaches to cublassDgemmStridedBatched? Is it even possible to launch that many kernels assuming that there is no limitation on memory. Or it is better to use cudaStream.
Thanks in advance!

The "zero stride" optimization mentioned inside the article (many small matrices are multiplied by a single matrix) is this already implemented in CUDA 9.0 ? This optimization would be really useful in both interfaces (pointer to pointer and strided batch).