cublasSgemm - is there a way to choose algorithm

whatdhack · August 13, 2022, 3:02am

Is there a way to specify or choose the algorithm or kernel to be used in cublasSgemm ? For example, suppose nsys shows something like ampere_sgemm_64x32_sliced1x4_nt. Is there a way to specify this specific kernel ?

Robert_Crovella · August 13, 2022, 3:08am

No, for cublasSgemm you have no control whatsoever.

If you want to use something like cublasGemmEx, you have some limited control of the algorithm, but its quite limited and nothing like getting to pick individual kernels. cublasLtmatmul likewise has some algorithm selection control.

whatdhack · August 13, 2022, 7:35pm

Thanks. See that the API for cublasGemmEx includes a field cublasGemmAlgo_t algo, and cublasLtMatmul includes a field cublasLtMatmulAlgo_t *algo. How do cublasGemmEx and cublasLtMatmul differ from cublasSgemm - in other words why do they exist ?

In nsys is there a way to find the matrix shapes that resulted in calling a specific gemm (for example ampere_sgemm_64x32_sliced1x4_nt ) ?

Robert_Crovella · August 13, 2022, 11:52pm

The BLAS functions like Sgemm, Dgemm, etc. provide certain functionality. One of the reasons these other functions exist is to provide blas-like functionality that is not actually provided by a typical netlib-style BLAS library. For example, multiplication of 16-bit floating point matrices. If you read the descriptions provided in the documentation, you will also get an idea of how they are different. For example GemmEx:

This function is an extension of cublas<t>gemm that allows the user to individually specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run

cublasLt provides a description as to why it exists.

No, there isn’t, except inferentially (try lots of different shapes, see what the profiler reports.)

whatdhack · August 14, 2022, 5:57am

Thanks. Out of curiosity, is there any fundamental limitation that prevents nsys from knowing matrix shapes that resulted in calling a kernel ? Can a single matrix multiplication result in calling multiple kernels ?

Robert_Crovella · August 14, 2022, 6:32pm

Yes, a single matrix multiply could involve multiple kernels. For detailed nsys questions, I suggest asking those on the nsys forums.

The decision about what kernel to use for a given operation is likely to involve a number of factors, including shape (m,n,k) as well as perhaps GPU architecture and perhaps other factors. These decisions are likely to occur in host code. From a profiler perspective, a profiler can tell you what code actually got executed, but the reasons for that you would have to infer yourself. If you wanted to discover it yourself you would have to follow the execution path with considerable study to learn something like:

“if m > 512, use kernel xyzabc”

These libraries don’t have source code included, so you would be seeing (host) code like

mov AX,m
cmp AX,512
jmpge launch_point_2
...

The profiler doesn’t have the smarts to inspect that code and come up with:

“kernel xyzabc will be launched if m > 512”

I won’t be able to answer “why?” beyond that, if its not already evident. please ask such questions on the nsys forum

mnicely · August 15, 2022, 9:45pm

For examples for using cuBLASLt, please see LtSgemmCustomFind and LtSgemmSimpleAutoTuning at our Math Libs GitHub.

Topic		Replies	Views
Using gcgemm from CuBLAS CUDA Programming and Performance	1	718	March 23, 2020
How does CuBLAS use Gpu multi-core? CUDA Programming and Performance	5	7718	February 6, 2011
CUBLAS grids and threads division GPU-Accelerated Libraries	7	3757	June 18, 2018
How to enable Tensor core for cublasSgemmBatched on H100? GPU-Accelerated Libraries cuda , kernel , cublas , cutlass	5	862	November 17, 2023
cublasLtMatmulAlgoGetHeuristic - How does this function select the kernel based on various parameters? GPU-Accelerated Libraries cuda , kernel , cublas	0	43	January 10, 2025
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4848	April 7, 2021
CUBLAS matrix multiplication matrix size limited by GPU memory size CUDA Programming and Performance	8	3504	August 2, 2010
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	969	September 5, 2019
Using CUDA Streams and cublasDsyrk to replace cublasDgemmBatched call results in very low performance CUDA Programming and Performance	7	884	November 14, 2018
cublasSgemm() alway fail during compute intensify task CUDA Programming and Performance	14	4560	January 8, 2015

cublasSgemm - is there a way to choose algorithm

Related topics