speed up "thin" matrix multiplications in cuBLAS

I’m relatively new to cuBLAS but it seems to me that cuBlasSgemm has a lot of overhead so that “thin” matrix multiplications are not fast enough. For example:

(1) cuBlasSgemm 3872 x 1 * 1 x 3136 took 0.746000 ms
(2) cuBlasSgemm 3136 x 3872 * 3872 x 1 took 0.375000 ms

(3) cuBlasSgemm 3872 x 9 * 9 x 3136 took 1.252000 ms
(4) cuBlasSgemm 3136 x 3872 * 3872 x 9 took 1.094000 ms

(5) cuBlasSgemm 3872 x 256 * 256 x 3136 took 5.678000 ms
(6) cuBlasSgemm 3136 x 3872 * 3872 x 256 took 3.458000 ms

How can I speed up (1), (2), (3) and (4)?
Thanks!

http://docs.nvidia.com/cuda/cublas/#batching-kernels
http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemmbatched

It’s not necessarily a question of overhead, but of different performance characteristics. As long as you multiply square-ish matrices, matrix multiplication is limited by computational throughput, I would estimate the required bandwidth at 30% to 50% of the GPU’s memory bandwidth

Your first case is the exact opposite: There is hardly any computation going on, and this operation is completely limited by memory bandwidth.

Thanks njuffa!
By “memory bandwidth”, you are not talking about the bandwidth between main memory and GPU memory, right?
All the data is on the GPU memory before calling this function.
So which bandwidth are you talking about?

Do you have any suggestion about how to speed up (1) and (2)?
Thanks

I am talking about GPU memory bandwidth. In your first case, I expect the code to be bottlenecked on writing out 48,570,368 bytes of data for the result matrix. I don’t know what GPU you have, but let’s assume this happens at an effective throughput of 150 GB/sec, than the writes alone would take 0.324 milliseconds.