I’m relatively new to cuBLAS but it seems to me that cuBlasSgemm has a lot of overhead so that “thin” matrix multiplications are not fast enough. For example:

(1) cuBlasSgemm 3872 x 1 * 1 x 3136 took 0.746000 ms

(2) cuBlasSgemm 3136 x 3872 * 3872 x 1 took 0.375000 ms

(3) cuBlasSgemm 3872 x 9 * 9 x 3136 took 1.252000 ms

(4) cuBlasSgemm 3136 x 3872 * 3872 x 9 took 1.094000 ms

(5) cuBlasSgemm 3872 x 256 * 256 x 3136 took 5.678000 ms

(6) cuBlasSgemm 3136 x 3872 * 3872 x 256 took 3.458000 ms

How can I speed up (1), (2), (3) and (4)?

Thanks!