CUBLAS grids and threads division


I am using CUDA code and specifically CUBLAS functions. My question is, how do I set the block grid and threads divsion when using a CUBLAS functions (i.e: cublasSgemm(…))? Is it internally optimized, and if so which SM is it best suited for? Is it generally optimized? Are there matrix sizes that work best for it (aside from the 32 factor for the size)?
Or simply put, is there a list somewhere of how to make those functions work best?
It is quite difficult to find an answer in the toolkit documentation.

Thanks for any answer!

You don’t get to pick block and grid dimensions when using CUBLAS - the library itself does that. It is internally optimized, and the optimizations include specialization depending on which kind of compute capability you are running on.

CUBLAS Library is indeed optimized for every architecture (especially GEMM routines)

For cublasSgemm in particular, it is well optimized on Kepler (SM35) and Maxwell (SM5x).
On Maxwell architecture, SGEMM works best if M and N are multiple of 128
On Kepler architecture, SGEMM works best if M is multiple of 256 and N multiple of 192.

Ah I See! thanks for the info!

But now I have to wonder, how do you guys know this? is there a documentation of this somewhere?

The CUBLAS API is documented here:

If you study that documentation, you’ll find no mention of block and grid dimensions - so I think it’s pretty evident there are no api parameters to specify those. Although CUBLAS uses the GPU, it is a C-callable API, and can be linked to using ordinary C compilers, like gcc. So you’re not really writing CUDA device kernels (where you would normally specify block and grid dimensions) when using CUBLAS.

Regarding the discussion about optimization, the statements I made are basically discoverable using one of the profilers, such as nvprof. You can take the same code (same executable) and run it on different GPUs, and with the profiler observe that CUBLAS is calling different device kernels under the hood - so it’s fairly evident that it is detecting the device it is running on, then choosing kernels accordingly.


I recently profiled Cublas Sgemm, I had multiple calls to Sgemm in my application.
One thing I noticed was that under the hood it called different kernel everytime.
The kernels were named something like sgemm_128X128_NN or sgemm_32X32_vec.
Can some one please explain what is the logic that decides which kernel to Call?
Also is 32x32 the size ofShared Mem per Thread block.
Also what does vec specify?

Those are internal implementation details that aren’t made publicly available by NVIDIA. They can and do change over time. A portion of the multitude of GEMM kernels is also GPU architecture specific, or at least that used to be the case.

Thank You, njuffa