Trtexec profile

Description

Hi, all.

I have a few questions about the logs from trtexec.

  1. Iterations
[08/09/2020-06:24:39] [I] === Profile (490 iterations ) ===
[08/09/2020-06:24:39] [I]          Layer   Time (ms)   Avg. Time (ms)   Time %
[08/09/2020-06:24:39] [I]  conv1 + relu1      217.14             0.44      6.9
[08/09/2020-06:24:39] [I]          norm1       44.27             0.09      1.4
[08/09/2020-06:24:39] [I]          pool1       19.21             0.04      0.6
[08/09/2020-06:24:39] [I]  conv2 + relu2      450.90             0.92     14.4
[08/09/2020-06:24:39] [I]          norm2       95.52             0.19      3.0
[08/09/2020-06:24:39] [I]          pool2       14.53             0.03      0.5
[08/09/2020-06:24:39] [I]  conv3 + relu3      186.97             0.38      6.0
[08/09/2020-06:24:39] [I]  conv4 + relu4      182.38             0.37      5.8
[08/09/2020-06:24:39] [I]  conv5 + relu5      102.50             0.21      3.3
[08/09/2020-06:24:39] [I]          pool5        6.21             0.01      0.2
[08/09/2020-06:24:39] [I]    fc6 + relu6     1146.71             2.34     36.5
[08/09/2020-06:24:39] [I]    fc7 + relu7      515.00             1.05     16.4
[08/09/2020-06:24:39] [I]            fc8      152.44             0.31      4.9
[08/09/2020-06:24:39] [I]           prob        7.73             0.02      0.2
[08/09/2020-06:24:39] [I]          Total     3141.50             6.41    100.0

I set --iterations=50 when I ran trtexec.

sudo trtexec --deploy=data/caffe_model/alexnet/deploy.prototxt --output=prob --iterations=50 --allowGPUFallback --dumpProfile

However, above profiles said it ran 490 iterations. Why the values of each iteration(one is iteration in the command line(=50), the other one is iteration in profile(=490)) are different with each other?

  1. Layers running on DLA/GPU
    I ran the same model(alexnet) with different options.
    Example1) sudo trtexec --deploy=data/caffe_model/resnet50/deploy.prototxt --output=prob --useDLACore=0 --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-08:16:02] [08/09/2020-08:16:04] [I] [TRT] 
[08/09/2020-08:16:04] [I] [TRT] --------------- Layers running on DLA: 
[08/09/2020-08:16:04] [I] [TRT] {conv1,bn_conv1,scale_conv1,conv1_relu,pool1,res2a_branch1,bn2a_branch1,scale2a_branch1,res2a_branch2a,bn2a_branch2a,scale2a_branch2a,res2a_branch2a_relu,res2a_branch2b,bn2a_branch2b,scale2a_branch2b,res2a_branch2b_relu,res2a_branch2c,bn2a_branch2c,scale2a_branch2c,res2a,res2a_relu,res2b_branch2a,bn2b_branch2a,scale2b_branch2a,res2b_branch2a_relu,res2b_branch2b,bn2b_branch2b,scale2b_branch2b,res2b_branch2b_relu,res2b_branch2c,bn2b_branch2c,scale2b_branch2c,res2b,res2b_relu,res2c_branch2a,bn2c_branch2a,scale2c_branch2a,res2c_branch2a_relu,res2c_branch2b,bn2c_branch2b,scale2c_branch2b,res2c_branch2b_relu,res2c_branch2c,bn2c_branch2c,scale2c_branch2c,res2c,res2c_relu,res3a_branch1,bn3a_branch1,scale3a_branch1,res3a_branch2a,bn3a_branch2a,scale3a_branch2a,res3a_branch2a_relu,res3a_branch2b,bn3a_branch2b,scale3a_branch2b,res3a_branch2b_relu,res3a_branch2c,bn3a_branch2c,scale3a_branch2c,res3a,res3a_relu,res3b_branch2a,bn3b_branch2a,scale3b_branch2a,res3b_branch2a_relu,res3b_branch2b,bn3b_branch2b,scale3b_branch2b,res3b_branch2b_relu,res3b_branch2c,bn3b_branch2c,scale3b_branch2c,res3b,res3b_relu,res3c_branch2a,bn3c_branch2a,scale3c_branch2a,res3c_branch2a_relu,res3c_branch2b,bn3c_branch2b,scale3c_branch2b,res3c_branch2b_relu,res3c_branch2c,bn3c_branch2c,scale3c_branch2c,res3c,res3c_relu,res3d_branch2a,bn3d_branch2a,scale3d_branch2a,res3d_branch2a_relu,res3d_branch2b,bn3d_branch2b,scale3d_branch2b,res3d_branch2b_relu,res3d_branch2c,bn3d_branch2c,scale3d_branch2c,res3d,res3d_relu,res4a_branch1,bn4a_branch1,scale4a_branch1,res4a_branch2a,bn4a_branch2a,scale4a_branch2a,res4a_branch2a_relu,res4a_branch2b,bn4a_branch2b,scale4a_branch2b,res4a_branch2b_relu,res4a_branch2c,bn4a_branch2c,scale4a_branch2c,res4a,res4a_relu,res4b_branch2a,bn4b_branch2a,scale4b_branch2a,res4b_branch2a_relu,res4b_branch2b,bn4b_branch2b,scale4b_branch2b,res4b_branch2b_relu,res4b_branch2c,bn4b_branch2c,scale4b_branch2c,res4b,res4b_relu,res4c_branch2a,bn4c_branch2a,scale4c_branch2a,res4c_branch2a_relu,res4c_branch2b,bn4c_branch2b,scale4c_branch2b,res4c_branch2b_relu,res4c_branch2c,bn4c_branch2c,scale4c_branch2c,res4c,res4c_relu,res4d_branch2a,bn4d_branch2a,scale4d_branch2a,res4d_branch2a_relu,res4d_branch2b,bn4d_branch2b,scale4d_branch2b,res4d_branch2b_relu,res4d_branch2c,bn4d_branch2c,scale4d_branch2c,res4d,res4d_relu,res4e_branch2a,bn4e_branch2a,scale4e_branch2a,res4e_branch2a_relu,res4e_branch2b,bn4e_branch2b,scale4e_branch2b,res4e_branch2b_relu,res4e_branch2c,bn4e_branch2c,scale4e_branch2c,res4e,res4e_relu,res4f_branch2a,bn4f_branch2a,scale4f_branch2a,res4f_branch2a_relu,res4f_branch2b,bn4f_branch2b,scale4f_branch2b,res4f_branch2b_relu,res4f_branch2c,bn4f_branch2c,scale4f_branch2c,res4f,res4f_relu,res5a_branch1,bn5a_branch1,scale5a_branch1,res5a_branch2a,bn5a_branch2a,scale5a_branch2a,res5a_branch2a_relu,res5a_branch2b,bn5a_branch2b,scale5a_branch2b,res5a_branch2b_relu,res5a_branch2c,bn5a_branch2c,scale5a_branch2c,res5a,res5a_relu,res5b_branch2a,bn5b_branch2a,scale5b_branch2a,res5b_branch2a_relu,res5b_branch2b,bn5b_branch2b,scale5b_branch2b,res5b_branch2b_relu,res5b_branch2c,bn5b_branch2c,scale5b_branch2c,res5b,res5b_relu,res5c_branch2a,bn5c_branch2a,scale5c_branch2a,res5c_branch2a_relu,res5c_branch2b,bn5c_branch2b,scale5c_branch2b,res5c_branch2b_relu,res5c_branch2c,bn5c_branch2c,scale5c_branch2c,res5c,res5c_relu,pool5,fc1000}, 
[08/09/2020-08:16:04] [I] [TRT] --------------- Layers running on GPU: 
[08/09/2020-08:16:04] [I] [TRT] prob, 

Example2) sudo trtexec --deploy=data/caffe_model/resnet50/deploy.prototxt --output=prob --iterations=50 --allowGPUFallback --dumpProfile

[08/09/2020-08:15:05] [I] [TRT] --------------- Layers running on DLA: 
[08/09/2020-08:15:05] [I] [TRT] 
[08/09/2020-08:15:05] [I] [TRT] --------------- Layers running on GPU: 
[08/09/2020-08:15:05] [I] [TRT] conv1 + conv1_relu, pool1, res2a_branch2a + res2a_branch2a_relu, res2a_branch2b + res2a_branch2b_relu, res2a_branch2c, res2a_branch1 + res2a + res2a_relu, res2b_branch2a + res2b_branch2a_relu, res2b_branch2b + res2b_branch2b_relu, res2b_branch2c + res2b + res2b_relu, res2c_branch2a + res2c_branch2a_relu, res2c_branch2b + res2c_branch2b_relu, res2c_branch2c + res2c + res2c_relu, res3a_branch2a + res3a_branch2a_relu, res3a_branch2b + res3a_branch2b_relu, res3a_branch2c, res3a_branch1 + res3a + res3a_relu, res3b_branch2a + res3b_branch2a_relu, res3b_branch2b + res3b_branch2b_relu, res3b_branch2c + res3b + res3b_relu, res3c_branch2a + res3c_branch2a_relu, res3c_branch2b + res3c_branch2b_relu, res3c_branch2c + res3c + res3c_relu, res3d_branch2a + res3d_branch2a_relu, res3d_branch2b + res3d_branch2b_relu, res3d_branch2c + res3d + res3d_relu, res4a_branch2a + res4a_branch2a_relu, res4a_branch2b + res4a_branch2b_relu, res4a_branch2c, res4a_branch1 + res4a + res4a_relu, res4b_branch2a + res4b_branch2a_relu, res4b_branch2b + res4b_branch2b_relu, res4b_branch2c + res4b + res4b_relu, res4c_branch2a + res4c_branch2a_relu, res4c_branch2b + res4c_branch2b_relu, res4c_branch2c + res4c + res4c_relu, res4d_branch2a + res4d_branch2a_relu, res4d_branch2b + res4d_branch2b_relu, res4d_branch2c + res4d + res4d_relu, res4e_branch2a + res4e_branch2a_relu, res4e_branch2b + res4e_branch2b_relu, res4e_branch2c + res4e + res4e_relu, res4f_branch2a + res4f_branch2a_relu, res4f_branch2b + res4f_branch2b_relu, res4f_branch2c + res4f + res4f_relu, res5a_branch2a + res5a_branch2a_relu, res5a_branch2b + res5a_branch2b_relu, res5a_branch2c, res5a_branch1 + res5a + res5a_relu, res5b_branch2a + res5b_branch2a_relu, res5b_branch2b + res5b_branch2b_relu, res5b_branch2c + res5b + res5b_relu, res5c_branch2a + res5c_branch2a_relu, res5c_branch2b + res5c_branch2b_relu, res5c_branch2c + res5c + res5c_relu, pool5, fc1000, prob, 
[08/09/2020-08:15:13] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
  • Example1 and Example2 run the same model(resnet50) but the layer compositions of them are different. Example1 has bn_conv1 but Example2 doesn’t have. I am wodering why layer composition is different depending on the device type.
  • What is the meaning of ‘{}’ in Example1, and the meaning of ‘+’ in Example2?
  1. Profile on DLA
  • I got execution time per layer when I ran trtexec on GPU.
  • sudo trtexec --deploy=data/caffe_model/alexnet/deploy.prototxt --output=prob --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-06:24:39] [I] === Profile (490 iterations ) ===
[08/09/2020-06:24:39] [I]          Layer   Time (ms)   Avg. Time (ms)   Time %
[08/09/2020-06:24:39] [I]  conv1 + relu1      217.14             0.44      6.9
[08/09/2020-06:24:39] [I]          norm1       44.27             0.09      1.4
[08/09/2020-06:24:39] [I]          pool1       19.21             0.04      0.6
[08/09/2020-06:24:39] [I]  conv2 + relu2      450.90             0.92     14.4
[08/09/2020-06:24:39] [I]          norm2       95.52             0.19      3.0
[08/09/2020-06:24:39] [I]          pool2       14.53             0.03      0.5
[08/09/2020-06:24:39] [I]  conv3 + relu3      186.97             0.38      6.0
[08/09/2020-06:24:39] [I]  conv4 + relu4      182.38             0.37      5.8
[08/09/2020-06:24:39] [I]  conv5 + relu5      102.50             0.21      3.3
[08/09/2020-06:24:39] [I]          pool5        6.21             0.01      0.2
[08/09/2020-06:24:39] [I]    fc6 + relu6     1146.71             2.34     36.5
[08/09/2020-06:24:39] [I]    fc7 + relu7      515.00             1.05     16.4
[08/09/2020-06:24:39] [I]            fc8      152.44             0.31      4.9
[08/09/2020-06:24:39] [I]           prob        7.73             0.02      0.2
[08/09/2020-06:24:39] [I]          Total     3141.50             6.41    100.0
  • However, when I ran trtexec on DLA, it outputs following profiles.
  • sudo trtexec --deploy=data/caffe_model/alexnet/deploy.prototxt --output=prob --useDLACore=0 --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-06:25:03] [I]                                                                                                                                                  Layer   Time (ms)   Avg. Time (ms)   Time %
[08/09/2020-06:25:03] [I]                                                                                                                                            data to nvm       39.47             0.12      1.3
[08/09/2020-06:25:03] [I]                                    {conv1,relu1,norm1,pool1,conv2,relu2,norm2,pool2,conv3,relu3,conv4,relu4,conv5,relu5,pool5,fc6,relu6,fc7,relu7,fc8}      126.55             0.38      4.1
[08/09/2020-06:25:03] [I]                                                                                                                                       data copy finish       21.55             0.06      0.7
[08/09/2020-06:25:03] [I]               {conv1,relu1,norm1,pool1,conv2,relu2,norm2,pool2,conv3,relu3,conv4,relu4,conv5,relu5,pool5,fc6,relu6,fc7,relu7,fc8} output reformatter 0     2914.08             8.75     93.7
[08/09/2020-06:25:03] [I]  {conv1,relu1,norm1,pool1,conv2,relu2,norm2,pool2,conv3,relu3,conv4,relu4,conv5,relu5,pool5,fc6,relu6,fc7,relu7,fc8} output to be reformatted 0 finish        1.92             0.01      0.1
[08/09/2020-06:25:03] [I]                                                                                                                                                   prob        4.68             0.01      0.2
[08/09/2020-06:25:03] [I]                                                                                                                              prob output reformatter 0        3.15             0.01      0.1
[08/09/2020-06:25:03] [I]                                                                                                                                                  Total     3111.40             9.34    100.0
  • Why it does not show every execution time per layer? (or how can I record execution time per layer?)
  • What is the exact meaning of ‘data to nvm’, ‘data copy finish’, ‘output reformatter 0’, and ‘output to be reformatted 0 finish’?
  • Or is there any detailed document related to how DLA works?

Any help will be very appreciated.

Regards.

Environment

TensorRT Version : 7.1.0
nvidia@xavier:~/tf_to_trt_image_classification$ dpkg -l | grep TensorRT

ii  graphsurgeon-tf                               7.1.0-1+cuda10.2                                 arm64        GraphSurgeon for TensorRT package
ii  libnvinfer-bin                                7.1.0-1+cuda10.2                                 arm64        TensorRT binaries
ii  libnvinfer-dev                                7.1.0-1+cuda10.2                                 arm64        TensorRT development libraries and headers
ii  libnvinfer-doc                                7.1.0-1+cuda10.2                                 all          TensorRT documentation
ii  libnvinfer-plugin-dev                         7.1.0-1+cuda10.2                                 arm64        TensorRT plugin libraries
ii  libnvinfer-plugin7                            7.1.0-1+cuda10.2                                 arm64        TensorRT plugin libraries
ii  libnvinfer-samples                            7.1.0-1+cuda10.2                                 all          TensorRT samples
ii  libnvinfer7                                   7.1.0-1+cuda10.2                                 arm64        TensorRT runtime libraries
ii  libnvonnxparsers-dev                          7.1.0-1+cuda10.2                                 arm64        TensorRT ONNX libraries
ii  libnvonnxparsers7                             7.1.0-1+cuda10.2                                 arm64        TensorRT ONNX libraries
ii  libnvparsers-dev                              7.1.0-1+cuda10.2                                 arm64        TensorRT parsers libraries
ii  libnvparsers7                                 7.1.0-1+cuda10.2                                 arm64        TensorRT parsers libraries
ii  nvidia-container-csv-tensorrt                 7.1.0.16-1+cuda10.2                              arm64        Jetpack TensorRT CSV file
ii  python-libnvinfer                             7.1.0-1+cuda10.2                                 arm64        Python bindings for TensorRT
ii  python-libnvinfer-dev                         7.1.0-1+cuda10.2                                 arm64        Python development package for TensorRT
ii  python3-libnvinfer                            7.1.0-1+cuda10.2                                 arm64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev                        7.1.0-1+cuda10.2                                 arm64        Python 3 development package for TensorRT
ii  tensorrt                                      7.1.0.16-1+cuda10.2                              arm64        Meta package of TensorRT
ii  uff-converter-tf                              7.1.0-1+cuda10.2                                 arm64        UFF converter for TensorRT package

GPU Type : 512-Core NVIDIA Volta @ 1377MHz
Nvidia Driver Version : ?
CUDA Version : V10.2.89

nvidia@xavier:~/tf_to_trt_image_classification$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_21:14:42_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

CUDNN Version : 8.0.0

nvidia@xavier:~/tf_to_trt_image_classification$ dpkg --list | grep libcudnn
ii  libcudnn8                                     8.0.0.145-1+cuda10.2                             arm64        cuDNN runtime libraries
ii  libcudnn8-dev                                 8.0.0.145-1+cuda10.2                             arm64        cuDNN development libraries and headers
ii  libcudnn8-doc                                 8.0.0.145-1+cuda10.2                             arm64        cuDNN documents and samples

Operating System + Version : ubuntu18.04
Python Version (if applicable) : 3.6.9
TensorFlow Version (if applicable) : 1.15.2

Hi @yjkim2,
Please refer to teh below link to understand DLA better.
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dla_topic
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#restrictions-with-dla

Thanks!

Thanks for your reply.

I just checked those urls you gave me.

Those urls contains useful information about API for using DLA on TensorRT.

However, what I really want to know is the workflow of DLA during inference so that
I can estimate the meaning of ‘data to nvm’, ‘data copy finish’, ‘output reformatter 0’, and ‘output to be reformatted 0 finish’ in Profile.

Sorry for the unclear question.

Any help could be very appreciated:)

Regards.

And, I want to know how to measure execution time for each layer when I use DLA device.
I used --dumpProfile option to see execution time for each layer, but it only shows overall execution time.
(please refer above post section3. Profile on DLA)

Any advice on this would be appreciated.

Thanks, in advance.

Hi @yjkim2

set --duration to zero to avoid run more iterations as requested

{} backet means these node are fold into a DLA node. + means two layers are fused into one layer. The reason you did not see bn_conv1 is because bn_conv1 is also fused but we seem not add its name.

DLA is one single node for TensorRT, we cannot profile DLA execution. xxx finish is finishNvmRegionLayer, you do not need to take care of related layers. They are there to make DLA works correctly.
xxx Copy and xxx Reformat are reformat node between GPU region and DLA nvm region.
The first output reformatter is DLA time + reformat time. Reformat time is quite short comparing with DLA time, so you can treat that time as real time DLA costs.

Thanks!

1 Like

Thank you for your detailed explanation!

I have one more thing to ask.

You said “The first output reformatter is DLA time + reformat time. Reformat time is quite short comparing with DLA time, so you can treat that time as real time DLA costs.”

Then, what does DLA do during DLA time of the first output reformatter?

Thanks!