How to Get the Exact Amount of Resources the GPU Uses at the Moment (e.g., Used Tensor Cores) Regardless of the Running Process

As the Nsight tool relies on defining a process or executable to attach to, and it only shows resources allocated to a specific process, it seems like there is no hardware feature from Nvidia to monitor totally what resources are being used in the GPU at the moment. So, Nsight acts as a software layer that attaches to a process and records what it does and what it will allocate.
Isn’t it? If not, how to achieve this? Thanks.
EDIT: The key to this question is whether Nvidia offers any hardware-accurated feature that reports the exact amount of resource usage (e.g., used tensor cores) or not?

There is Nsight Compute, which is more process-centric, and Nsight Systems, collecting more information about the whole system.

Thanks for your answer. However, even with Nsight Systems, there is no feature provided to get the exact amount of resources the GPU uses at the moment (e.g., used tensor cores) without specifying a process or executable, regardless of whether this feature is a GPU hardware feature or a software-implemented feature.

Not a perfect solution: You could create a background program using the debugger API and profiler API (Debugger API :: CUDA Toolkit Documentation and NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit | NVIDIA Developer), this background program attaching to processes automatically.

Another way could be to write a program, which in the background tries to be active at the same time as other Cuda programs, use not many resources (4 warps per SM) and demanding access to the Tensor Core in regular intervals and measure the speed. It would not work well, if another process needs the whole SM (with the maximum number of threads), however.

1 Like

Thanks for your answer , So, there is no software way implemented from NVIDIA to get access to accurate performance metrics (e.g., tensor cores in use) regardless of the running process. As PTX/SASS doesn’t have access to GPU (Performance Monitoring Unit)PMUs(?), and it being managed by the driver itself, like context switching(?), it seems that recording performance metrics will be possible regardless of the running process, but it’s not implemented at the driver level or in API libraries.(?)

I have no knowledge about such a possibility.

SASS has access to some performance monitoring features, but not in a documented way, and more for creating monitored events, e.g. to increase some performance counters, whenever those SASS instructions are executed.