I’m trying to use ncu to benchmark some applications for their performance regarding the usage of Tensor Cores (the devices I’m using are a 3080 and a A100).
For simple scenarios where I’m performing matrix multiplication with known values for M,N and K, I can calculate the # of FLOPs from these values, and using the execution time I can calculate the performance.
But in other scenarios, such as when I’m trying to benchmark another more complex application that might have different mat muls, and other TC operations, I’d like to use a more general query through ncu, for example. In the same way one can use sm__inst_executed_pipe_fma to obtain the number of FMA instructions and then multiply by two to obtain the executed FLOPs, I’d like to know if there is an alternative strategy regarding TC.
I’ve seen that there is the sm__inst_executed_pipe_tensor query, but since different Tensor Instructions have different FLOP outputs (different shapes: eg, m16n8k4 and m16n8k8 for TF32), it does not seem to be a viable option to properly calculate the executed FLOPs in the TC, without knowing what shape is being executed.
I’d love if someone could please give some advice on this matter. Thank you.
I guess to get a first order approximation, I would look at the percent utilization of the relevant pipe (in “Compute Throughput Breakdown”) as reported in the GPU SOL report from ncu, and then multiply that by the device’s peak theoretical performance for that pipe.
Thank you for your contribution!
I’ve tried your suggestion and compared it with my well-known values and it seems to differ quite significantly.
In my tests, when I execute the code with ncu, I obtain a performance of 25TFLOPs, which is roughly 85% of the peak of my 3080. However, sm__pipe_tensor_cycles_elapsed.avg.pct_of_peak_sustained_elapsed returns a usage of 49.91%, which is much lower than the performance I’m actually achieving. Any reason why this might be?
Do you have the 3080 variant with 68 or with 70 SMs? What is your clock speed in MHz (base/boost)? Sparse or Dense Matrices? Do you count Additions and Multiplications each (which doubles FLOPs)? For the 3080 (different to the A100), when using FP16 A and B input matrices the performance also depends on whether you accumulate with FP32 or FP16 precision.
TF32 m16n8k8!
I’ve arrived at the 25TF by calculating the dimensions of the matrices in the mat mul (mnk) and dividing by the execution time.
In reality, it’s a bit more complicated because every thread is running the same ptx instruction mma.m16n8k8 for a number of ITERATIONS, so in the end fma = (uint64_t)M * N * K * ITERATIONS * (THREADS_PER_BLOCK / 32) * NUM_BLOCKS.
And then flops=2*fma.
I’ve pointed out my calculations for flops in my previous answer, and I’m confident they are correct. I do have the 68SM 3080, and thank you for raising an interesting point that I completely missed.
When executing ncu it does locks the clock at 1440 MHz, which is how I obtained the 25TFLOPs, which matches the expected maximum attainable value at that speed. In all other testing I do (without ncu), I use 1710 MHz and achieve the 29.8TF!
So in the case I pointed out, I’m actually pretty much fully utilizing the GPU and still the pipe only indicated 49.9% usage.
Can you confirm, you used the dense variant? m16n8k8 exists as both.
For FP32 summation (which is used for TF32), the consumer compute capability 8.6 Tensor Cores have half the speed of the professional compute capability 8.6 Tensor Cores.
Based on that just guessing for a reason of your results:
Perhaps the shown utilization does not consider this fact. Or the (input) pipe is full-speed, but a later unit is slower and the (input) pipe is utilized 50%? Perhaps Robert knows?
Can you try FP16 with FP16 accumulation (which has no difference between consumer and professional GPUs) or try the A100 with TF32?
can you paste in a picture from the nsight compute GUI showing the compute throughput breakdown in the SOL section.
before taking the screenshot, hover your mouse over the top item title in the pareto list, so we can see the computation of that pipe throughput, as well as the reported throughput.
I hope this is what you meant! This time I ran slightly less iterations just so that the profiling would be quicker, but it should still be extremely close to top expected performance.
@Curefab yes, dense! I have the same code in FP16 with 16 accum, and ran that, and you might be on to something because now the tensor pipe was indicating 97,86%! The performance achieved when running at 1710MHz was 116.51TF, which is 97,9% of the maximum throughput of 119TF at this clock!
The following is Turing, with no difference between consumer and professional Tensor Core performance, but also about too low pipe utilization with a factor of 2 or 4 (50% and 25%) - could be related: Why Low Tensor Pipe Utilization
It certainly seems like its based on nsight compute reporting. The link Curefab posted may be the answer, or your could post your question on the nsight compute forum.
There was an issue (once?) with stored roofline of Tensor Cores for different GPU architectures (basically rooflines have been only correct for GV100 Volta GPUs):
or
Those detailed counters could help you calculate exact FLOPs (but only if you know some details of your instructions; it is not enough for fully unknown code to deduce FLOPs):
Amazing content, thank you so much! It didn’t occur to me that it would be more appropriate to search in the ncu forums (I’m fairly new to these forums), but thank you and @Robert_Crovella for the help you have provided.
The three bottom links you have provided seem to be the answer to my worries, I don’t have those metrics available but I’m updating my version of ncu to see if it makes them available (I have the 2022.4 version).