Hi,
Could you run the nvprof for your implementation to see the detailed backend API first?
$ sudo /usr/local/cuda-10.2/bin/nvprof [your app]
Although TensorRT leverage cuDNN basically, some operations might use other library instead.
Here is my profiling result for the sample_mnist and you can see it mainly use the cuBLAS(gemm) and cuDNN:
==20592== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 11.73% 4.9390ms 296 16.685us 448ns 194.47us [CUDA memcpy HtoD]
10.66% 4.4887ms 149 30.125us 416ns 330.29us [CUDA memset]
3.77% 1.5858ms 8 198.23us 151.85us 247.60us trt_volta_sgemm_128x128_relu_nn_v1
3.66% 1.5408ms 23 66.990us 14.368us 143.85us void cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, bool=1, bool=0, int=0, int=0, int=0>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, bool=1, bool=0, int=0, int=0, int=0>, float*, float, float*, cudnn::reduced_divisor, float, float, float, float, int, cudnnConvolutionStruct const *, float const *, cudnnActivationStruct)
3.11% 1.3091ms 8 163.63us 83.940us 245.68us trt_volta_sgemm_64x64_relu_nn_v1
...
Thanks.