Is CUBLAS_GEMM_DEFAULT_TENSOR_OP in cublasGemmEX no longer supported?

Hi, I am using the Nvidia Jetson Orin Developer Kit 64GB (Jetpack 5.0.2).

I was trying to use cublasGemmEX to run gemm operations using only Tensor Core.

My question is as follows.

  1. is it correct that I can execute gemm operation with only Tensor Core by using the below function? I will also leave a link to the source of the function. (
cublasErrCheck(cublasGemmEx(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
                matrix_m, matrix_n, matrix_k,
                a_fp32, CUDA_R_32F, MATRIX_M,
                b_fp32, CUDA_R_32F, MATRIX_K,
                c_cublas, CUDA_R_32F, MATRIX_M,
                cuda_r_32f, cublas_gemm_default_tensor_op));
  1. I checked the documentation and saw that CUBLAS_GEMM_DEFAULT_TENSOR_OP is no longer supported, is that correct? If so, is there any way to use a similar function? Using other BLAS library or CUBLAS library’s other functions…
    (1. Introduction — cublas 12.2 documentation)

  2. While researching to execute gemm operations using only Tensor Core, I heard that I can use gemm from a library called cuTENSOR. Is it possible to use TensorCore Contraction to execute gemm operations using only Tensor Core?

Are there any libraries or functions for gemm operations that run on Tensor Core alone provided by Nvidia?

Just use CUBLAS_GEMM_DEFAULT. Heuristics will chose the fastest implementation, whether it’s tensor cores or not.

Same with cuTENSOR, the library will chose the best implementation.

Tensor cores are generally used by default, if a kernel exist.

What I want is to proceed with Gemm using only Tensor Core without using CUDA core.

If I proceed with the option you told me (CUBLAS_GEMM_DEFAULT), I don’t think I’m running the gemm operation using Tensor Core only, is that right?

Then can you tell me the option to proceed with Gemm using Tensor Core only? (Without CUDA core)

GEMM kernels are either Tensor core accelerated or SIMT (using CUDA cores). An example of a SIMT kernels would be FP64 GEMM on Pascal. TC kernels are usually faster than SIMT kernels for the same hardware. The older flag, CUBLAS_GEMM_DEFAULT_TENSOR_OP was for a time when Tensor Core path wasn’t default. It is today. So just use CUBLAS_GEMM_DEFAULT and let heuristics chose the best kernel. If you don’t think it is, you can manually benchmark each algorithm.