Cuda roofline analysis when kernel is below the roof


I have this roofline analysis from ncu, but I am not able to understand it fully. Is my kernel compute bound or memory bounded? It seems that none of them. The GPU is RTX2080.

Based on the location of your kernel on the roofline chart, you are not currently limited by the hardware limitations of the memory or compute subsystems. Not every kernel will be bound by the hardware limits.

1 Like

I’m curious to know what’s going on. Where can I find resources for this topic? If the hardware is not limiting, why doesn’t it run faster or reach the memory and processor limits?

There’s a good overview of roofline here Roofline Performance Model - NERSC Documentation In general, if you’re no where near the roofs, take a look at the other sections of the report. They may have information on what else could be limiting your performance.

The roofline analysis mentions double precision. You have an RTX2080, a Turing SM7.5 GPU.

Looking at the Programming guide here, for 7.X, 64-bit floating-point add, multiply, multiply-add are listed as having a throughput of 32 ops/cycle, except if you click on the “5” subscript next to it, you find it actually only has a throughput of 2 ops/cycle for SM7.5.

This could explain your poor performance.