I’m relatively new to cuBLAS but it seems to me that cuBlasSgemm has a lot of overhead so that “thin” matrix multiplications are not fast enough. For example:
(1) cuBlasSgemm 3872 x 1 * 1 x 3136 took 0.746000 ms
(2) cuBlasSgemm 3136 x 3872 * 3872 x 1 took 0.375000 ms
(3) cuBlasSgemm 3872 x 9 * 9 x 3136 took 1.252000 ms
(4) cuBlasSgemm 3136 x 3872 * 3872 x 9 took 1.094000 ms
(5) cuBlasSgemm 3872 x 256 * 256 x 3136 took 5.678000 ms
(6) cuBlasSgemm 3136 x 3872 * 3872 x 256 took 3.458000 ms
How can I speed up (1), (2), (3) and (4)?
Thanks!