How to enable Tensor core for cublasSgemmBatched on H100?

I have tried to use cublasSetMathMode(blasHandle,CUBLAS_TF32_TENSOR_OP_MATH) to apply TF32 in cublasSgemmBatched.

Before I set TF32 mode, the nsight system profiling shows:

4.0% sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32×32×8_stage3_warpsize1×2×1_ffma_aligna4_alignc4_execute_kernel_51_cublas

It’s clear that there is no tensor core.

However, when i set the mathmode, the nsight system profiling changes like this:

4.2% kernel 58.9% void cutlass:Kernel(T1::Params) 41.1% void cutlass:Kernel(T1::Params)

By comparing the kernel before and after, I determine this kernel assumes the function of the cublasSgemmBatched.

I found that in 2020, someone posted that if you want to use tensorcore in sgemm, cutlass will actually be called.[Does CUBLAS SGEMM work with tensor cores yet?] I’m not sure if this explains the above.

This leads to further problems, since I cannot see a more detailed description of the kernel, and unless doing some data tests, I cannot directly determine whether the tensor core has been successfully turned on.

It’s not clear what your question is. You’ve already enabled tensor core and it appears to have changed the code behavior.

You can use the nsight compute profiler for this. There are numerous forum questions and even a blog article about how to use nsight compute to verify TC usage/activity.

Using the metrics will be one method. Another method is simply to study the compute workload analysis section in the default nsight compute reporting.

Thanks for your help. I will check the blog later.
When I run dgemm on h100, nsys shows result like this:

100.0% sm90_xmma_gemm_f64f64_f64f64_f64_nn_n_tilesize32×32×32_stage3_warpsize2×2×1_tensor16×8×8_excute_kernel_51_cublas

So there is a clear “tensor” in the statement.
But in this case, it just shows

4.2% kernel 58.9% void cutlass:Kernel(T1::Params) 41.1% void cutlass:Kernel(T1::Params)

So i’m not sure what the change means as there is no “tensor” in the statement.

Maybe I shouldn’t just judge whether tensor core been turned on by the kernel statement?

I would judge whether TC is being used via the profiler, as I already mentioned.

There isn’t a decoder ring for judging things by kernel names. And even if it seems like there was one in the past, there was no specification for any such thing, so expecting to decode kernel names into the infinite future to determine TC usage is probably not sensible. Therefore I would use the profiler if it were important to me; that is a deterministic method.

Do as you wish, of course.

Thanks a lot!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.