have a moderate number of small (3x4) matrices, and a constant large matrix (4x400 for example).
My approach so far has been to group the small matrices and perform Sgemm using streams. Been batching them in groups of 4 or 8, but not seeing too much overlap (via Hyper-Q).
Saw a SO post about the same thing, but rather than matrix as ‘B’ it was a vector.
In the past I have used cuSparse to repmat small dense matrices across the diag of a large sparse matrix, and used that to multiply against a constant concatenated dense vector.
Any idea of a more optimal method of performing this group of operations (preferably using some library in the CUDA SDK) ?
Thanks. I did look at the batched cuBLAS, but was not sure how to handle ‘B’ if it was constant.
According to your link (SO) I will need to make copies of that same matrix, so will give that a go…
“Even though you want to multiply your array of matrices (M) by a single matrix (N), the batch gemm function will require you to pass also an array of matrices for N (i.e. N), which will all be the same in your case.”
Since you are passing an array of pointers to B instead of B itself, it should be possible to simply pass the array with all pointers pointing to the same matrix B in device memory.
The SO link has been updated with a fully worked example. A small modification to the worked example should allow you to pass a single B matrix (referred to as N in that example) and simply have an array of pointers to B that all point to the same matrix.