CUBLAS Configuration The use of CUBLAS for small matrix

Hello,

I have tested some CUBLAS functions and compared them with a known CPU CBLAS implementation. My problem is related to small matrixs, so I want to know if I can optimize my performances when I use CUBLAS for a simple Matrix product of size < 4000? , is there any configuration variables to set to define the thread block size when we use CUBLAS?.

Thanks.
.H

CUBLAS will choose a thread block size that is optimized to provide best performance for your matrix size. 4000 x 4000 is plenty big to keep the GPU busy if you are doing matrix-matrix multiplication (SGEMM). Are you seeing poor performance? Note that power-of-2 matrices will likely perform better than non-power-of-2.

Mark

Hello,
Thank you for your answer. In fact, using sgemm, performances are better every time that I use a big matrix (2 time big then the old in my bench, all my tests are with power-of-2). My question was about gettig those performances with smaller matrix.

Generally I can say that sgemm is the ideal case ( 15 times CPU perf). When I use sgemv ,sdot or saxpy, performances are quite 2 times better then CPU.

NB: I dont take the data initialisation time in consideration (allocation + copy).

.H

sgemv, sdot and saxpy and other BLAS1/2 routines are bandwidth limited, not compute limited, so the speedup won’t be as large as BLAS3 routines like sgemm.

Very small data sizes are inherently not going to perform that well on the GPU – it’s a highly parallel processor so it needs high data parallelism to be efficient.

Your best bet is to make sure you are minimizing transfers to and from the GPU and doing lots of BLAS routines that operate on data already on the GPU without transfers between calls.

Mark