I’m attempting to int8 quantize an LSTM built with the Loop API and profile the tensor ops. I’m collecting the performance counters with NCU 2022.2.1.0:
So far none of the kernels show any counts for the selected counters. I was hoping to see some kernels indicating matrix multiplication for the connections within the LSTM cells and be able to see those operations reflected in ncu counters. Is there a recommended method of profiling TRT int8 graphs?
Those counters should work if tensor ops like HMMA are used. What GPU are you using? Could you share a screenshot of the SASS assembly for a kernel you’re expecting to use Tensor instructions? We can check if there are actual tensor ops in there.
The GPU I am using is a 2080 Ti. Is there a way to extract SASS for a TRT engine built at runtime? I was expecting the int8 ops to show for the matrix-vector products below:
The SASS generated at runtime should still be stored in the Nsight Compute report. If you open the source view, it will show the SASS instructions so you can look for these Tensor Operations.