Matching layer to nvprof output

I’ve created a simple nvinfer1::IProfiler class and when I run the net after setting the profiler I get layer 7 as the most time consuming layer

Layer [6]: [(Unnamed Layer* 6) [Shuffle]]: 0.14832ms
Layer [7]: [(Unnamed Layer* 7) [Convolution] + (Unnamed Layer* 9) [Activation]]: 4.04125ms
Layer [8]: [(Unnamed Layer* 10) [Pooling]]: 0.79664ms
Layer [9]: [(Unnamed Layer* 10) [Pooling] output reformatter 0]: 0.241216ms

NVprof shows this:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   25.32%  209.63ms        52  4.0314ms  4.0190ms  4.0893ms  trt_volta_scudnn_128x64_relu_medium_nn_v1
                   19.10%  158.13ms       364  434.42us  268.55us  621.42us  trt_volta_h884cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1
                   15.91%  131.78ms       364  362.03us  14.368us  1.1701ms  trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
                    6.68%  55.331ms       780  70.937us     736ns  1.3295ms  [CUDA memcpy HtoH]
                    6.34%  52.480ms       312  168.20us  13.952us  1.5067ms  trt_volta_h884cudnn_256x64_ldg8_relu_exp_medium_nhwc_tn_v1
                    5.02%  41.564ms        52  799.31us  789.71us  1.1940ms  void nvinfer1::tiled_pooling::poolCHW_RS3_UV2_PQT_kernel<int=4, int=4, int=32, int=1, 

How do I know for sure that this layer (7) is the one causing the 25% of the time as shown by nvprof?
How do I see why it didn’t use the Tensor cores (if I understand the output correctly)?



Please check our layer level profiler.
It can give you the estimated performance of each layer.