Is it acceptable to measure kernel performance using Nsight Compute?

Hello!

I’m in a situation where it’s difficult to measure using cudaEvent or other internal functions, so I’m planning to measure the duration with Nsight Compute and then divide the data throughput by this duration to create a performance metric. I’m wondering if this would be a reliable method.

Thank you for reading my question

Nsight Compute has a reliable timing method. This method is much more accurate and reliable than CUDA Events. NCU default settings are defined to produce deterministic results by fixing the GPU clock rate, and setting the memory system into a more deterministic state. You may want to use the following settings to control these behaviors:

  --cache-control arg (=all)            Control the behavior of the GPU caches during profiling. Allowed values:
                                          all
                                          none
  --clock-control arg (=base)           Control the behavior of the GPU clocks during profiling. Allowed values:
                                          base
                                          (Lock GPU clocks to base)
                                          none
                                          (Don't lock clocks)
                                          reset
                                          (Reset GPU clocks and exit)

Disabling cache control or not locking clocks can result in multi-pass metrics provide less accurate results that can even be out of a valid range due to differences between passes. The default settings try to remove differences between pass; however, this results in the value not being the same.

EXAMPLES OF DIFFERENCES

  1. L2 Priming - If the application does a host to device memcpy then grid in the case of --cache-control=all the L2 will be invalidated losing potential priming done by the memcpy or previous kernel.
  2. Clock Rate - --clock-control=base can impact the compute:bandwidth ratio on 100 class GPUs as HBM based GPUs do not have the ability to change the memory clock; however, base clock will reduce the SM clock.
  3. Concurrent Execution - NCU serialize grids/ranges (depending on settings). This provides all resources to the target grid/range. In normal execution the application may have concurrent execution on the compute engine or copy engines that impact the duration.

Nsight Systems neither fixes the clocks nor caches between kernels/ranges. The timing results between NCU and NSYS can differ even if NCU is set to --cache-control=none, --clock-control=none.

2 Likes

Thank you for your answer!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.