cublasSgemm - is there a way to choose algorithm

Is there a way to specify or choose the algorithm or kernel to be used in cublasSgemm ? For example, suppose nsys shows something like ampere_sgemm_64x32_sliced1x4_nt. Is there a way to specify this specific kernel ?

No, for cublasSgemm you have no control whatsoever.

If you want to use something like cublasGemmEx, you have some limited control of the algorithm, but its quite limited and nothing like getting to pick individual kernels. cublasLtmatmul likewise has some algorithm selection control.

Thanks. See that the API for cublasGemmEx includes a field cublasGemmAlgo_t algo, and cublasLtMatmul includes a field cublasLtMatmulAlgo_t *algo. How do cublasGemmEx and cublasLtMatmul differ from cublasSgemm - in other words why do they exist ?

In nsys is there a way to find the matrix shapes that resulted in calling a specific gemm (for example ampere_sgemm_64x32_sliced1x4_nt ) ?

The BLAS functions like Sgemm, Dgemm, etc. provide certain functionality. One of the reasons these other functions exist is to provide blas-like functionality that is not actually provided by a typical netlib-style BLAS library. For example, multiplication of 16-bit floating point matrices. If you read the descriptions provided in the documentation, you will also get an idea of how they are different. For example GemmEx:

This function is an extension of cublas<t>gemm that allows the user to individually specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run

cublasLt provides a description as to why it exists.

No, there isn’t, except inferentially (try lots of different shapes, see what the profiler reports.)

Thanks. Out of curiosity, is there any fundamental limitation that prevents nsys from knowing matrix shapes that resulted in calling a kernel ? Can a single matrix multiplication result in calling multiple kernels ?

Yes, a single matrix multiply could involve multiple kernels. For detailed nsys questions, I suggest asking those on the nsys forums.

The decision about what kernel to use for a given operation is likely to involve a number of factors, including shape (m,n,k) as well as perhaps GPU architecture and perhaps other factors. These decisions are likely to occur in host code. From a profiler perspective, a profiler can tell you what code actually got executed, but the reasons for that you would have to infer yourself. If you wanted to discover it yourself you would have to follow the execution path with considerable study to learn something like:

“if m > 512, use kernel xyzabc”

These libraries don’t have source code included, so you would be seeing (host) code like

mov AX,m
cmp AX,512
jmpge launch_point_2
...

The profiler doesn’t have the smarts to inspect that code and come up with:

“kernel xyzabc will be launched if m > 512”

I won’t be able to answer “why?” beyond that, if its not already evident. please ask such questions on the nsys forum

1 Like

For examples for using cuBLASLt, please see LtSgemmCustomFind and LtSgemmSimpleAutoTuning at our Math Libs GitHub.