How to enable Tensor core for cublasSgemmBatched on H100?

ingridli · November 2, 2023, 1:20pm

I have tried to use cublasSetMathMode(blasHandle,CUBLAS_TF32_TENSOR_OP_MATH) to apply TF32 in cublasSgemmBatched.

Before I set TF32 mode, the nsight system profiling shows:

4.0% sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32×32×8_stage3_warpsize1×2×1_ffma_aligna4_alignc4_execute_kernel_51_cublas

It’s clear that there is no tensor core.

However, when i set the mathmode, the nsight system profiling changes like this:

4.2% kernel 58.9% void cutlass:Kernel(T1::Params) 41.1% void cutlass:Kernel(T1::Params)

By comparing the kernel before and after, I determine this kernel assumes the function of the cublasSgemmBatched.

I found that in 2020, someone posted that if you want to use tensorcore in sgemm, cutlass will actually be called.[Does CUBLAS SGEMM work with tensor cores yet?] I’m not sure if this explains the above.

This leads to further problems, since I cannot see a more detailed description of the kernel, and unless doing some data tests, I cannot directly determine whether the tensor core has been successfully turned on.

Robert_Crovella · November 2, 2023, 2:20pm

It’s not clear what your question is. You’ve already enabled tensor core and it appears to have changed the code behavior.

You can use the nsight compute profiler for this. There are numerous forum questions and even a blog article about how to use nsight compute to verify TC usage/activity.

Using the metrics will be one method. Another method is simply to study the compute workload analysis section in the default nsight compute reporting.

ingridli · November 2, 2023, 2:28pm

Thanks for your help. I will check the blog later.
When I run dgemm on h100, nsys shows result like this:

100.0% sm90_xmma_gemm_f64f64_f64f64_f64_nn_n_tilesize32×32×32_stage3_warpsize2×2×1_tensor16×8×8_excute_kernel_51_cublas

So there is a clear “tensor” in the statement.
But in this case, it just shows

4.2% kernel 58.9% void cutlass:Kernel(T1::Params) 41.1% void cutlass:Kernel(T1::Params)

So i’m not sure what the change means as there is no “tensor” in the statement.

Maybe I shouldn’t just judge whether tensor core been turned on by the kernel statement?

Robert_Crovella · November 2, 2023, 2:35pm

I would judge whether TC is being used via the profiler, as I already mentioned.

There isn’t a decoder ring for judging things by kernel names. And even if it seems like there was one in the past, there was no specification for any such thing, so expecting to decode kernel names into the infinite future to determine TC usage is probably not sensible. Therefore I would use the profiler if it were important to me; that is a deterministic method.

Do as you wish, of course.

ingridli · November 3, 2023, 1:33am

Thanks a lot!

system · November 17, 2023, 1:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	516	November 22, 2022
How to confirm Tensor Core is working or not in CuSPARSE GPU-Accelerated Libraries cuda	4	873	May 12, 2023
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	984	March 23, 2024
How can I get the utilization of cuda core and tensor core respectively? Profiling Linux Targets	5	3112	January 10, 2023
How can I prevent my customized CUDA kernel function from using tensor cores on a Jetson Orin device? Jetson AGX Orin cuda , kernel	19	1000	February 5, 2024
Why does cublasSgemm uses `f16` for `float`? GPU-Accelerated Libraries cublas	7	1282	March 8, 2023
Disable Tensor Cores in cuBLAS functions explicity GPU-Accelerated Libraries cublas	4	2220	January 28, 2022
Ensuring the execution of GEMM done by Tensor Core GPU-Accelerated Libraries	1	419	August 19, 2022
Is CUBLAS_GEMM_DEFAULT_TENSOR_OP in cublasGemmEX no longer supported? GPU-Accelerated Libraries cublas , cutensor	3	1274	September 6, 2023
Benchmark result with vs. without tensor core GPU-Accelerated Libraries	7	55	February 15, 2025

How to enable Tensor core for cublasSgemmBatched on H100?

Related topics