Understanding Memory Tables and Roofline Modell


Im trying to understand the outputs of Nsight Compute.

One thing I dont get is the Device Memory. I wrote a really simple Kernel:

global void roofline(float * test) {
test[0] = 1;

The Memory Table shows one L1/TEX Store, which makes sense. But I dont understand why there are 5 sectors loaded from the device memory. It says one sector are 32 byte, a L1 or L2 Cache Line is 128 byte = 4 sectors. How is it possible, that 5 sectors were loaded? Shouldnt that number always be a multiple of 4?

The second thing im trying to understand are the Rooflines in the Roofline modell. It says, the peak work are 12.943.774.647.887,32 FLOP/s. But the Website says, that the used GPU (Quadro RTX 6000) has a peak performance of 16,3 TFLOPS. So how are those Rooflines calculated and why are those numbers used?
Also I dont understand how the Peak Traffic is calculated.

And I think there a Bug? If the mouse is on an Ridge Point it shows some values like peak performance,… but only if there is no achieved value present in the diagramm. And it only works at one ridge point? (Using ncu Compute for Windows)

Any help would be very appreciated!

Seems like the last update fixed the mentioned bug.

For the sectors question, what you’re actually seeing here are loads, not stores, so they are probably not related to that test[0] = 1 statement. Loads from device memory can be from things other than your kernel. In this case it may be instructions loaded into the cache or something else. The cache is a writeback cache so the store you’re doing may never go back to device memory, hence the 0 stores in the profile. In general, your thoughts about what values you should be seeing are correct if we didn’t have these caveats. As an aside, the profiler isn’t a perfect tool for measuring these types of tiny kernels sometimes because of the issues I described above.

For the roofline, the peak in the tool is calculated based on a “base” frequency that the profiler locks the clocks to. This is a frequency that can be reliably achieved for long periods of time as opposed to boosted frequencies that are not guaranteed and may not be repeatable. This choice was made to ensure results could be compared fairly. The chip in production could still achieve the numbers advertised. The peak traffic percentage represents the total amount of data your kernel transferred compared to the total possible supported by the hardware based on the length of the kernel. So if you transferred 100MB during your kernel, but the hardware could have transferred 1000MB, you have 10% of peak traffic.

Glad the bug was fixed. Thanks for letting us know.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.