NCU performance counters for profiling tensor ops for TRT int8 loop RNN

I’m attempting to int8 quantize an LSTM built with the Loop API and profile the tensor ops. I’m collecting the performance counters with NCU 2022.2.1.0:

sm__inst_executed_pipe_fma
sm__inst_executed_pipe_tensor
sm__inst_executed_pipe_tensor_op_imma
sm__pipe_tensor_cycles_active
sm__pipe_tensor_op_imma_cycles_active

and the kernels that are profiled are named

__myl_bb0_6_Gat
__myl_bb0_6_Gat
__myl_bb0_4_Gat
__myl_bb0_4_Gat
__myl_bb0_2_Add
__myl_bb0_2_Add
__myl_bb5_5_Mov
__myl_bb5_5_Mov
__myl_bb5_3_Mov
__myl_bb5_5_Mov
__myl_bb5_5_Mov
__myl_bb9_2_ResResCon
__myl_bb9_2_ResResCon

So far none of the kernels show any counts for the selected counters. I was hoping to see some kernels indicating matrix multiplication for the connections within the LSTM cells and be able to see those operations reflected in ncu counters. Is there a recommended method of profiling TRT int8 graphs?

Those counters should work if tensor ops like HMMA are used. What GPU are you using? Could you share a screenshot of the SASS assembly for a kernel you’re expecting to use Tensor instructions? We can check if there are actual tensor ops in there.

The GPU I am using is a 2080 Ti. Is there a way to extract SASS for a TRT engine built at runtime? I was expecting the int8 ops to show for the matrix-vector products below:

ILoop* sequenceLoop = network->addLoop();
sequenceLoop->addTripLimit(*sequenceSize, TripLimit::kCOUNT);

ITensor* input = sequenceLoop->addIterator(*inputTensors.data)->getOutput(0);
IRecurrenceLayer* hidden = sequenceLoop->addRecurrence(*inputTensors.hidden);
IRecurrenceLayer* cell = sequenceLoop->addRecurrence(*inputTensors.cell);

ITensor* mmInput = network
    ->addMatrixMultiply(*input, nvinfer1::MatrixOperation::kVECTOR,
        *params.inputWeights, nvinfer1::MatrixOperation::kTRANSPOSE)
    ->getOutput(0);

ITensor* mmHidden = network
    ->addMatrixMultiply(*hidden->getOutput(0), nvinfer1::MatrixOperation::kVECTOR,
        *params.recurrentWeights, nvinfer1::MatrixOperation::kTRANSPOSE)
    ->getOutput(0);