have a moderate number of small (3x4) matrices, and a constant large matrix (4x400 for example).

My approach so far has been to group the small matrices and perform Sgemm using streams. Been batching them in groups of 4 or 8, but not seeing too much overlap (via Hyper-Q).

Saw a SO post about the same thing, but rather than matrix as ‘B’ it was a vector.

In the past I have used cuSparse to repmat small dense matrices across the diag of a large sparse matrix, and used that to multiply against a constant concatenated dense vector.

Any idea of a more optimal method of performing this group of operations (preferably using some library in the CUDA SDK) ?