speed up "thin" matrix multiplications in cuBLAS

I’m relatively new to cuBLAS but it seems to me that cuBlasSgemm has a lot of overhead so that “thin” matrix multiplications are not fast enough. For example:

(1) cuBlasSgemm 3872 x 1 * 1 x 3136 took 0.746000 ms
(2) cuBlasSgemm 3136 x 3872 * 3872 x 1 took 0.375000 ms

(3) cuBlasSgemm 3872 x 9 * 9 x 3136 took 1.252000 ms
(4) cuBlasSgemm 3136 x 3872 * 3872 x 9 took 1.094000 ms

(5) cuBlasSgemm 3872 x 256 * 256 x 3136 took 5.678000 ms
(6) cuBlasSgemm 3136 x 3872 * 3872 x 256 took 3.458000 ms

How can I speed up (1), (2), (3) and (4)?
Thanks!

try a rank-1 update operation for (1):

[url]cuBLAS :: CUDA Toolkit Documentation

try a gemv operation for (2):

[url]http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemv[/url]