Tensor metrics in NsightCompute

In [1], I see that a tensor related metric for integer instructions.

tensor_int_fu_utilization The utilization level of the multiprocessor function units that execute tensor core int8 instructions on a scale of 0 to 10. This metric is only available for device with compute capability 7.2.

On 2080Ti which is CC=7.5, nvprof doesn’t work and on the other hand I see that NsightCompute has no metric related to that [2]. It only supports utilization for FP instructions (sm_​_pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active).

So, isn’t there any way to collect data for that? Why NsightCompute doesn’t support that?

Moreover, as far as I know, tensor operations generally use FP calculations. So, what does integer unit utilization really mean for tensor instructions?

[1] https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-7x
[2] https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-collection

So, isn’t there any way to collect data for that? Why NsightCompute doesn’t support that?
Nsight Compute does support collecting tensor INT8 (IMMA) information for SM 7.5/Turing devices, this information is missing from that table. We will work to update it. SM 7.0 (Volta GV100) did not have support for INT8 tensor operations, while they were added with the embedded Xavier chip (SM 7.2). See e.g. slide 20 of this presentation http://info.nvidia.com/rs/156-OFN-742/images/Jetson_AGX_Xavier_New_Era_Autonomous_Machines.pdf or here https://developer.nvidia.com/embedded/faq. Turing on the other hand also has support for INT8 (see e.g. https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/).

The base metrics for Turing queried with

nv-nsight-cu-cli --query-metrics --chip tu102

are called

sm__pipe_tensor_cycles_active               # of cycles where tensor pipe was active                                                           
sm__pipe_tensor_op_hmma_cycles_active       # of cycles where HMMA tensor pipe was active                                                      
sm__pipe_tensor_op_imma_cycles_active       # of cycles where IMMA tensor pipe was active

As noted in the PDF I linked, HMMA refers to FP16 and IMMA to INT8 operations.

Moreover, as far as I know, tensor operations generally use FP calculations. So, what does integer unit utilization really mean for tensor instructions?
Networks can also use INT8 inputs if lower precision is not an issue. As the data is smaller and more data can be processed at the same time, using this can lead to better performance.

Thanks. I will follow that.