Calculating utilization (core load) of Tensor core and cuda core seperately

Architecture: Turing
DL: using TensorRT
I know using the tensor_precision_fu_utilization and tensor_int_fu_utilization the tensor core utilization can be found for each kernel scaling 0 to 10. Is there a convenient way to find out the total utilization of tensor cores lets say over 1 second ? without actually going into each kernel utilization? I want to use the CUPTI APIs to basically Segregate the tensor core and CUDA core utilization complete deep learning network over a period of time frame. we are using right now nvmlDeviceGetUtilizationRates() nvml library function to know the GPU utilization but I think this API returns the total GPU core load and the bifurcation between tensor and cuda cores is not in it.


Hi Vivek,

If I understand the use case, you want to capture the tensor core usage data without serializing the kernels in the application, is that correct? I think tool like DCGM (Data Center GPU Manager) is better suited for this use case as it can provide a set of metrics at the device-level with low performance overhead in a continuous manner. I assume “Tensor Activity” is the metric you are interested in. More details can be found at