Hello, I have been using nsight compute to measure the performance of matrix multiplication kernels.
While explicit time measurement sees the first kernel call being much slower than subsequent calls,
When I use Nsight Compute to measure, it shows similar duration for each kernel call.
One thing that comes to mind is the phrase “==PROF== Profiling “Kernel2” - 0: 0%…50%…100% - 42 passes”
This seems to imply that Nsight Compute performs 42 iterations of each kernel call.
Does this mean that beside the first iterations, 41 iterations operate on a warm cache?
This would imply that Nsight Compute doesn’t accurately capture the cache state from which a kernel would operates on.