DL: using TensorRT
I know using the tensor_precision_fu_utilization and tensor_int_fu_utilization the tensor core utilization can be found for each kernel scaling 0 to 10. Is there a convenient way to find out the total utilization of tensor cores lets say over 1 second ? without actually going into each kernel utilization? I want to use the CUPTI APIs to basically Segregate the tensor core and CUDA core utilization complete deep learning network over a period of time frame. we are using right now nvmlDeviceGetUtilizationRates() nvml library function to know the GPU utilization but I think this API returns the total GPU core load and the bifurcation between tensor and cuda cores is not in it.