Variations in profiled metrics


I try to run a roofline analysis on some small test program. I found that "Device Memory Read Throughput" are quite different between each run. In an extreme case, it would be 0 B/s. I understand that the memory cod-state may affect the value, if I am right, including a lot of runtime effects. Therefore, is there an average, a mean value after several test runs, that I can use by turn on some option?


If it’s a small test program, then your results probably won’t be reliable. My suggestion is to provide a larger input to your test program, one which will generate kernel large enough such that there are warps distributed across all the SMs.

What your observing could be the result of the performance counters not encountering any events from your kernel. According to this paper

“We use the Compute Visual Profiler v4.0.7 for collecting the performance-counter data from which we compute our metrics. Note that, in Fermi-based GPUs, only one SM is equipped with performance-counter hardware. Hence, we can only measure information about the thread blocks that execute on that SM.”

There are two possible reasons for your observed variations in value:

  1. A very small test case may never actually read from device memory. If the data set is small and prior to the kernel the data set was copied to the GPU it may all reside in L2 cache resulting in no reads to device memory.

  2. The performance monitor system in Fermi and parts of Kepler are not able to observe all counters in one pass of the program. If the test program is very small work distribution differences between replays can result in in accurate results.

In order to fix case (2) it is recommend to increase the problem set such that you fill the GPU with work and you access a data set greater than the size of the L2 cache.