Question on 42 passes through a kernel neglicting cold cache

Hello, I have been using nsight compute to measure the performance of matrix multiplication kernels.

While explicit time measurement sees the first kernel call being much slower than subsequent calls,
When I use Nsight Compute to measure, it shows similar duration for each kernel call.

One thing that comes to mind is the phrase “==PROF== Profiling “Kernel2” - 0: 0%…50%…100% - 42 passes”
This seems to imply that Nsight Compute performs 42 iterations of each kernel call.
Does this mean that beside the first iterations, 41 iterations operate on a warm cache?
This would imply that Nsight Compute doesn’t accurately capture the cache state from which a kernel would operates on.

The default cache setting for profiling, is “cache-control all”:

“All GPU caches are flushed before each kernel replay iteration during profiling. While metric values in the execution environment of the application might be slightly different without invalidating the caches, this mode offers the most reproducible metric results across the replay passes and also across multiple runs of the target application.”

1 Like

Thank you for clarifying this!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.