I am running a sample cuBLAS based gemm kernel and profiling it on a v100 using ncu. the datatype is fp16, so I am using TCs. Now when I run an 8192x8192x8192 gemm kernel and profile it using it nsight compute, I see the no of hmma.884.f16.f16.xxx instructions is 1073741824
I am trying to make sense of that number and its relation to the gemm shape and instruction granularity, now 819281928192/(884) = 2336227328 . This is double the no of reported hmma.884 instructions by ncu. Could you please tell me the reason behind this difference? and if I am missing something?
Thanks for reaching out. It’s not uncommon for the number of instructions executed to be significantly different than the expected value based on rough calculations for how many math ops need to be done. For example, an FMA instruction may do 2 math ops (multiply and add) in a single instruction. I don’t think all of the details of how hmma works under the hood are public, but in this case, it seems reasonable that only half as many assembly instructions are required to complete all the math operations.
I checked many gemm shapes in v100 and a100 the same way I mentioned here, this mismatch only happens in v100, meanwhile in a100 it matches the expected value. Any thoughts on that? or any way to investigate v100 sass more?
Does the mismatch happen on all/many shapes of v100 or only some specific set? Can you share which ones match and which ones don’t?
I checked for mnk = 64, 128, 256, 512, 1024, 2048, 4096, 8192 for now and I see the same no of instructions relative to mnk and hmma.884 I can check more shapes as well, but I think there is a pattern here.
You’re saying for all those sizes, v100 has hmma ~= half the expected value but for a100 all those sizes match the expected value?
for v100 I checked all these sizes, and yes the no of hmma884 instructions is half the no of expected instructions all these.
for a100 I checked only fewer cases and they use hmma16816:
m=n=k=8192, no of instrs = 272M (almost equals what is expected)
m=n=k=64, no of instrs = 256 (2x what is expected, this can be an outlier due to small gemm shape)
m=n=k=1024, no of instrs = 524288 (equals what is expected)
Thanks for the details. We’re still looking into this internally. At first glance, the numbers don’t seem unexpected, but I haven’t found a good public resource explaining why. I’m continuing to ask around internally and I’ll let you know when I have more information.