cublas efficiency for block matrices?

Guys,

I’ve got an algorithm in which I need to multiply a block diagonal matrix of dimension 128x128, with the block bandwidth being 12, times a block vector of dimension 128x12 (i.e. matrix size is 1536x1536, with 18432 data points, and the rest zeros). This is a multiple of 32, which cublas likes as I understand…

I know that other BLAS routines have special functions for block and banded matrices, but this is not in cublas, yet.

So, does anyone have a feel for how cublasSgemm will perform on such a sparsely populated matrix versus a custom kernel? As an alternative, I could multiply the 12x12s 128 times, but I think this would be even more inefficient (even assuming the data was already copied in block to the GPU).

I’m sure I’ll start on a custom kernel soon, and it seems like a worthwhile thing to do. But before I did, I was wondering if perhaps cublas was doing something behind the scenes to precondition the data, as to avoid all of those wasted multiplications of zero. Seems like this would be a common issue…?

Thanks, and happy holidays!