CUBLAS grids and threads division

Zuker · April 18, 2015, 7:15am

Hello,

I am using CUDA code and specifically CUBLAS functions. My question is, how do I set the block grid and threads divsion when using a CUBLAS functions (i.e: cublasSgemm(…))? Is it internally optimized, and if so which SM is it best suited for? Is it generally optimized? Are there matrix sizes that work best for it (aside from the 32 factor for the size)?
Or simply put, is there a list somewhere of how to make those functions work best?
It is quite difficult to find an answer in the toolkit documentation.

Thanks for any answer!

Robert_Crovella · April 18, 2015, 2:06pm

You don’t get to pick block and grid dimensions when using CUBLAS - the library itself does that. It is internally optimized, and the optimizations include specialization depending on which kind of compute capability you are running on.

philippev · April 20, 2015, 4:27pm

CUBLAS Library is indeed optimized for every architecture (especially GEMM routines)

For cublasSgemm in particular, it is well optimized on Kepler (SM35) and Maxwell (SM5x).
On Maxwell architecture, SGEMM works best if M and N are multiple of 128
On Kepler architecture, SGEMM works best if M is multiple of 256 and N multiple of 192.

Zuker · April 21, 2015, 4:38am

Ah I See! thanks for the info!

But now I have to wonder, how do you guys know this? is there a documentation of this somewhere?

Robert_Crovella · April 21, 2015, 4:55am

The CUBLAS API is documented here:

[url]http://docs.nvidia.com/cuda/cublas/index.html#abstract[/url]

If you study that documentation, you’ll find no mention of block and grid dimensions - so I think it’s pretty evident there are no api parameters to specify those. Although CUBLAS uses the GPU, it is a C-callable API, and can be linked to using ordinary C compilers, like gcc. So you’re not really writing CUDA device kernels (where you would normally specify block and grid dimensions) when using CUBLAS.

Regarding the discussion about optimization, the statements I made are basically discoverable using one of the profilers, such as nvprof. You can take the same code (same executable) and run it on different GPUs, and with the profiler observe that CUBLAS is calling different device kernels under the hood - so it’s fairly evident that it is detecting the device it is running on, then choosing kernels accordingly.

anikam · June 18, 2018, 4:23pm

Hello,

I recently profiled Cublas Sgemm, I had multiple calls to Sgemm in my application.
One thing I noticed was that under the hood it called different kernel everytime.
The kernels were named something like sgemm_128X128_NN or sgemm_32X32_vec.
Can some one please explain what is the logic that decides which kernel to Call?
Also is 32x32 the size ofShared Mem per Thread block.
Also what does vec specify?

njuffa · June 18, 2018, 8:50pm

Those are internal implementation details that aren’t made publicly available by NVIDIA. They can and do change over time. A portion of the multitude of GEMM kernels is also GPU architecture specific, or at least that used to be the case.

anikam · June 18, 2018, 9:01pm

Thank You, njuffa

Topic		Replies	Views
Question of using cublassgemm() for matrix mulitiplication CUDA Programming and Performance	3	1030	January 28, 2015
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3795	April 4, 2007
What is the settings (blocks per grid) & (threads per block) will be created when APIs of cublas or cudnn is called? CUDA Programming and Performance	4	695	July 5, 2023
How does CuBLAS use Gpu multi-core? CUDA Programming and Performance	5	7823	February 6, 2011
cuBLAS sgemm is slow CUDA Programming and Performance	4	2581	June 26, 2017
How threads/blocks are mapped on GPU while calling cublasSgemm routines? GPU-Accelerated Libraries	0	1122	February 13, 2013
CUBLAS 3.0 DGEMM performance on Tesla Fermi CUDA Programming and Performance	1	11469	May 14, 2010
What's the best matrix size for cublasSgemm performance ? GPU-Accelerated Libraries	1	1697	February 16, 2017
Using gcgemm from CuBLAS CUDA Programming and Performance	1	767	March 23, 2020
cublasSgemm - is there a way to choose algorithm GPU-Accelerated Libraries cublas	6	1934	August 15, 2022

CUBLAS grids and threads division

Related topics