Watch Resource Usage of an SM in Real Time

To fully understand the quality of a CUDA kernel, a direct approach would be to watch SM work state in real time, e.g., to look for latency hide between arithmetic operations and memory accesses. Is there a tool where such things can be really watched and analyzed? I know Nsight Compute provides some information, but that is far from watching hardware in real time. Maybe Nvidia has such internal tools?

NVIDAI DCGM, Data Center GPU Monitoring, provides access to basic information such as

  • SM Activity
  • SM Occupancy
  • Tensor Activity
  • FP32 Activity

The API allows the values to be queried up to 10Hz. These values are the average across all unit instances.

DCGM
DCGM Profiling Metrics

Nsight Systems and Nsight Graphics GPU Trace support faster trace of GPU metrics; however, these do not provide real-time display as the data rate at 100kHz sampling can be multiple GiB/s so real-time processing is not practical. Both of these tools allow the user to start/stop collection multiple times during the execution of the application. After each capture the report can be opened and analyzed.