I’m attempting to int8 quantize an LSTM built with the Loop API and profile the tensor ops. I’m collecting the performance counters with NCU 2022.2.1.0:
and the kernels that are profiled are named
So far none of the kernels show any counts for the selected counters. I was hoping to see some kernels indicating matrix multiplication for the connections within the LSTM cells and be able to see those operations reflected in ncu counters. Is there a recommended method of profiling TRT int8 graphs?
Those counters should work if tensor ops like HMMA are used. What GPU are you using? Could you share a screenshot of the SASS assembly for a kernel you’re expecting to use Tensor instructions? We can check if there are actual tensor ops in there.
The GPU I am using is a 2080 Ti. Is there a way to extract SASS for a TRT engine built at runtime? I was expecting the int8 ops to show for the matrix-vector products below:
ILoop* sequenceLoop = network->addLoop();
ITensor* input = sequenceLoop->addIterator(*inputTensors.data)->getOutput(0);
IRecurrenceLayer* hidden = sequenceLoop->addRecurrence(*inputTensors.hidden);
IRecurrenceLayer* cell = sequenceLoop->addRecurrence(*inputTensors.cell);
ITensor* mmInput = network
ITensor* mmHidden = network