Reported number of hmma instructions by Nsight Compute

mustafaali · June 7, 2023, 6:46pm

Hi,

I am running a sample cuBLAS based gemm kernel and profiling it on a v100 using ncu. the datatype is fp16, so I am using TCs. Now when I run an 8192x8192x8192 gemm kernel and profile it using it nsight compute, I see the no of hmma.884.f16.f16.xxx instructions is 1073741824

I am trying to make sense of that number and its relation to the gemm shape and instruction granularity, now 819281928192/(884) = 2336227328 . This is double the no of reported hmma.884 instructions by ncu. Could you please tell me the reason behind this difference? and if I am missing something?

Thanks.

jmarusarz · June 8, 2023, 8:58pm

Thanks for reaching out. It’s not uncommon for the number of instructions executed to be significantly different than the expected value based on rough calculations for how many math ops need to be done. For example, an FMA instruction may do 2 math ops (multiply and add) in a single instruction. I don’t think all of the details of how hmma works under the hood are public, but in this case, it seems reasonable that only half as many assembly instructions are required to complete all the math operations.

mustafaali · June 8, 2023, 9:30pm

I checked many gemm shapes in v100 and a100 the same way I mentioned here, this mismatch only happens in v100, meanwhile in a100 it matches the expected value. Any thoughts on that? or any way to investigate v100 sass more?

jmarusarz · June 8, 2023, 9:40pm

Does the mismatch happen on all/many shapes of v100 or only some specific set? Can you share which ones match and which ones don’t?

mustafaali · June 8, 2023, 9:52pm

I checked for mnk = 64, 128, 256, 512, 1024, 2048, 4096, 8192 for now and I see the same no of instructions relative to mnk and hmma.884 I can check more shapes as well, but I think there is a pattern here.

jmarusarz · June 8, 2023, 10:19pm

You’re saying for all those sizes, v100 has hmma ~= half the expected value but for a100 all those sizes match the expected value?

mustafaali · June 8, 2023, 10:31pm

for v100 I checked all these sizes, and yes the no of hmma884 instructions is half the no of expected instructions all these.

for a100 I checked only fewer cases and they use hmma16816:
m=n=k=8192, no of instrs = 272M (almost equals what is expected)
m=n=k=64, no of instrs = 256 (2x what is expected, this can be an outlier due to small gemm shape)
m=n=k=1024, no of instrs = 524288 (equals what is expected)

jmarusarz · June 22, 2023, 8:42pm

Thanks for the details. We’re still looking into this internally. At first glance, the numbers don’t seem unexpected, but I haven’t found a good public resource explaining why. I’m continuing to ask around internally and I’ll let you know when I have more information.

Topic		Replies	Views
How to analysis the stall wait in this HMMA case Nsight Compute	3	467	October 31, 2024
How to measure FLOPs of a cuda kernel function by using Nsight-Compute on A100 GPU? Nsight Compute kernel	2	770	August 16, 2024
About CUTLASS example with nsight Nsight Compute	1	409	April 14, 2020
Unexpected rounding behavior in HFMA CUDA Programming and Performance	5	1015	March 23, 2017
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	335	August 20, 2024
Number of floating point operations in one HMMA instruction Nsight Compute cuda	2	1317	May 20, 2024
Trying to understand why Sectors/Req in wmma_example is 8 Sec/Req CUDA Programming and Performance	1	42	September 2, 2024
The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction CUDA Programming and Performance	5	125	August 22, 2024
IMMA roofline analysis in NSight Compute Nsight Compute	4	1145	August 17, 2023
Is there a way to inspect the time cost of each individual cuda block? Nsight Compute	12	178	October 30, 2024

Reported number of hmma instructions by Nsight Compute

Related topics