Hi,
I’ve created a simple nvinfer1::IProfiler class and when I run the net after setting the profiler I get layer 7 as the most time consuming layer
....
Layer [6]: [(Unnamed Layer* 6) [Shuffle]]: 0.14832ms
Layer [7]: [(Unnamed Layer* 7) [Convolution] + (Unnamed Layer* 9) [Activation]]: 4.04125ms
Layer [8]: [(Unnamed Layer* 10) [Pooling]]: 0.79664ms
Layer [9]: [(Unnamed Layer* 10) [Pooling] output reformatter 0]: 0.241216ms
,....
NVprof shows this:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 25.32% 209.63ms 52 4.0314ms 4.0190ms 4.0893ms trt_volta_scudnn_128x64_relu_medium_nn_v1
19.10% 158.13ms 364 434.42us 268.55us 621.42us trt_volta_h884cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1
15.91% 131.78ms 364 362.03us 14.368us 1.1701ms trt_volta_h884cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1
6.68% 55.331ms 780 70.937us 736ns 1.3295ms [CUDA memcpy HtoH]
6.34% 52.480ms 312 168.20us 13.952us 1.5067ms trt_volta_h884cudnn_256x64_ldg8_relu_exp_medium_nhwc_tn_v1
5.02% 41.564ms 52 799.31us 789.71us 1.1940ms void nvinfer1::tiled_pooling::poolCHW_RS3_UV2_PQT_kernel<int=4, int=4, int=32, int=1,
How do I know for sure that this layer (7) is the one causing the 25% of the time as shown by nvprof?
How do I see why it didn’t use the Tensor cores (if I understand the output correctly)?
thanks
Eyal