multiple matrix-matrix multiplications

have a moderate number of small (3x4) matrices, and a constant large matrix (4x400 for example).

My approach so far has been to group the small matrices and perform Sgemm using streams. Been batching them in groups of 4 or 8, but not seeing too much overlap (via Hyper-Q).

Saw a SO post about the same thing, but rather than matrix as ‘B’ it was a vector.

In the past I have used cuSparse to repmat small dense matrices across the diag of a large sparse matrix, and used that to multiply against a constant concatenated dense vector.

Any idea of a more optimal method of performing this group of operations (preferably using some library in the CUDA SDK) ?

GEMM + streams (i.e. concurrent kernels)

batched GEMM:

GEMM + dynamic parallelism

from here:


Thanks. I did look at the batched cuBLAS, but was not sure how to handle ‘B’ if it was constant.

According to your link (SO) I will need to make copies of that same matrix, so will give that a go…

“Even though you want to multiply your array of matrices (M) by a single matrix (N), the batch gemm function will require you to pass also an array of matrices for N (i.e. N), which will all be the same in your case.”

Since you are passing an array of pointers to B instead of B itself, it should be possible to simply pass the array with all pointers pointing to the same matrix B in device memory.

The SO link has been updated with a fully worked example. A small modification to the worked example should allow you to pass a single B matrix (referred to as N in that example) and simply have an array of pointers to B that all point to the same matrix.


Did not even think to do that(with the pointers)…Thanks