Description
Hi, all.
I have a few questions about the logs from trtexec.
- Iterations
[08/09/2020-06:24:39] [I] === Profile (490 iterations ) ===
[08/09/2020-06:24:39] [I] Layer Time (ms) Avg. Time (ms) Time %
[08/09/2020-06:24:39] [I] conv1 + relu1 217.14 0.44 6.9
[08/09/2020-06:24:39] [I] norm1 44.27 0.09 1.4
[08/09/2020-06:24:39] [I] pool1 19.21 0.04 0.6
[08/09/2020-06:24:39] [I] conv2 + relu2 450.90 0.92 14.4
[08/09/2020-06:24:39] [I] norm2 95.52 0.19 3.0
[08/09/2020-06:24:39] [I] pool2 14.53 0.03 0.5
[08/09/2020-06:24:39] [I] conv3 + relu3 186.97 0.38 6.0
[08/09/2020-06:24:39] [I] conv4 + relu4 182.38 0.37 5.8
[08/09/2020-06:24:39] [I] conv5 + relu5 102.50 0.21 3.3
[08/09/2020-06:24:39] [I] pool5 6.21 0.01 0.2
[08/09/2020-06:24:39] [I] fc6 + relu6 1146.71 2.34 36.5
[08/09/2020-06:24:39] [I] fc7 + relu7 515.00 1.05 16.4
[08/09/2020-06:24:39] [I] fc8 152.44 0.31 4.9
[08/09/2020-06:24:39] [I] prob 7.73 0.02 0.2
[08/09/2020-06:24:39] [I] Total 3141.50 6.41 100.0
I set --iterations=50 when I ran trtexec.
sudo trtexec --deploy=data/caffe_model/alexnet/deploy.prototxt --output=prob --iterations=50 --allowGPUFallback --dumpProfile
However, above profiles said it ran 490 iterations. Why the values of each iteration(one is iteration in the command line(=50), the other one is iteration in profile(=490)) are different with each other?
- Layers running on DLA/GPU
I ran the same model(alexnet) with different options.
Example1)sudo trtexec --deploy=data/caffe_model/resnet50/deploy.prototxt --output=prob --useDLACore=0 --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-08:16:02] [08/09/2020-08:16:04] [I] [TRT]
[08/09/2020-08:16:04] [I] [TRT] --------------- Layers running on DLA:
[08/09/2020-08:16:04] [I] [TRT] {conv1,bn_conv1,scale_conv1,conv1_relu,pool1,res2a_branch1,bn2a_branch1,scale2a_branch1,res2a_branch2a,bn2a_branch2a,scale2a_branch2a,res2a_branch2a_relu,res2a_branch2b,bn2a_branch2b,scale2a_branch2b,res2a_branch2b_relu,res2a_branch2c,bn2a_branch2c,scale2a_branch2c,res2a,res2a_relu,res2b_branch2a,bn2b_branch2a,scale2b_branch2a,res2b_branch2a_relu,res2b_branch2b,bn2b_branch2b,scale2b_branch2b,res2b_branch2b_relu,res2b_branch2c,bn2b_branch2c,scale2b_branch2c,res2b,res2b_relu,res2c_branch2a,bn2c_branch2a,scale2c_branch2a,res2c_branch2a_relu,res2c_branch2b,bn2c_branch2b,scale2c_branch2b,res2c_branch2b_relu,res2c_branch2c,bn2c_branch2c,scale2c_branch2c,res2c,res2c_relu,res3a_branch1,bn3a_branch1,scale3a_branch1,res3a_branch2a,bn3a_branch2a,scale3a_branch2a,res3a_branch2a_relu,res3a_branch2b,bn3a_branch2b,scale3a_branch2b,res3a_branch2b_relu,res3a_branch2c,bn3a_branch2c,scale3a_branch2c,res3a,res3a_relu,res3b_branch2a,bn3b_branch2a,scale3b_branch2a,res3b_branch2a_relu,res3b_branch2b,bn3b_branch2b,scale3b_branch2b,res3b_branch2b_relu,res3b_branch2c,bn3b_branch2c,scale3b_branch2c,res3b,res3b_relu,res3c_branch2a,bn3c_branch2a,scale3c_branch2a,res3c_branch2a_relu,res3c_branch2b,bn3c_branch2b,scale3c_branch2b,res3c_branch2b_relu,res3c_branch2c,bn3c_branch2c,scale3c_branch2c,res3c,res3c_relu,res3d_branch2a,bn3d_branch2a,scale3d_branch2a,res3d_branch2a_relu,res3d_branch2b,bn3d_branch2b,scale3d_branch2b,res3d_branch2b_relu,res3d_branch2c,bn3d_branch2c,scale3d_branch2c,res3d,res3d_relu,res4a_branch1,bn4a_branch1,scale4a_branch1,res4a_branch2a,bn4a_branch2a,scale4a_branch2a,res4a_branch2a_relu,res4a_branch2b,bn4a_branch2b,scale4a_branch2b,res4a_branch2b_relu,res4a_branch2c,bn4a_branch2c,scale4a_branch2c,res4a,res4a_relu,res4b_branch2a,bn4b_branch2a,scale4b_branch2a,res4b_branch2a_relu,res4b_branch2b,bn4b_branch2b,scale4b_branch2b,res4b_branch2b_relu,res4b_branch2c,bn4b_branch2c,scale4b_branch2c,res4b,res4b_relu,res4c_branch2a,bn4c_branch2a,scale4c_branch2a,res4c_branch2a_relu,res4c_branch2b,bn4c_branch2b,scale4c_branch2b,res4c_branch2b_relu,res4c_branch2c,bn4c_branch2c,scale4c_branch2c,res4c,res4c_relu,res4d_branch2a,bn4d_branch2a,scale4d_branch2a,res4d_branch2a_relu,res4d_branch2b,bn4d_branch2b,scale4d_branch2b,res4d_branch2b_relu,res4d_branch2c,bn4d_branch2c,scale4d_branch2c,res4d,res4d_relu,res4e_branch2a,bn4e_branch2a,scale4e_branch2a,res4e_branch2a_relu,res4e_branch2b,bn4e_branch2b,scale4e_branch2b,res4e_branch2b_relu,res4e_branch2c,bn4e_branch2c,scale4e_branch2c,res4e,res4e_relu,res4f_branch2a,bn4f_branch2a,scale4f_branch2a,res4f_branch2a_relu,res4f_branch2b,bn4f_branch2b,scale4f_branch2b,res4f_branch2b_relu,res4f_branch2c,bn4f_branch2c,scale4f_branch2c,res4f,res4f_relu,res5a_branch1,bn5a_branch1,scale5a_branch1,res5a_branch2a,bn5a_branch2a,scale5a_branch2a,res5a_branch2a_relu,res5a_branch2b,bn5a_branch2b,scale5a_branch2b,res5a_branch2b_relu,res5a_branch2c,bn5a_branch2c,scale5a_branch2c,res5a,res5a_relu,res5b_branch2a,bn5b_branch2a,scale5b_branch2a,res5b_branch2a_relu,res5b_branch2b,bn5b_branch2b,scale5b_branch2b,res5b_branch2b_relu,res5b_branch2c,bn5b_branch2c,scale5b_branch2c,res5b,res5b_relu,res5c_branch2a,bn5c_branch2a,scale5c_branch2a,res5c_branch2a_relu,res5c_branch2b,bn5c_branch2b,scale5c_branch2b,res5c_branch2b_relu,res5c_branch2c,bn5c_branch2c,scale5c_branch2c,res5c,res5c_relu,pool5,fc1000},
[08/09/2020-08:16:04] [I] [TRT] --------------- Layers running on GPU:
[08/09/2020-08:16:04] [I] [TRT] prob,
Example2) sudo trtexec --deploy=data/caffe_model/resnet50/deploy.prototxt --output=prob --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-08:15:05] [I] [TRT] --------------- Layers running on DLA:
[08/09/2020-08:15:05] [I] [TRT]
[08/09/2020-08:15:05] [I] [TRT] --------------- Layers running on GPU:
[08/09/2020-08:15:05] [I] [TRT] conv1 + conv1_relu, pool1, res2a_branch2a + res2a_branch2a_relu, res2a_branch2b + res2a_branch2b_relu, res2a_branch2c, res2a_branch1 + res2a + res2a_relu, res2b_branch2a + res2b_branch2a_relu, res2b_branch2b + res2b_branch2b_relu, res2b_branch2c + res2b + res2b_relu, res2c_branch2a + res2c_branch2a_relu, res2c_branch2b + res2c_branch2b_relu, res2c_branch2c + res2c + res2c_relu, res3a_branch2a + res3a_branch2a_relu, res3a_branch2b + res3a_branch2b_relu, res3a_branch2c, res3a_branch1 + res3a + res3a_relu, res3b_branch2a + res3b_branch2a_relu, res3b_branch2b + res3b_branch2b_relu, res3b_branch2c + res3b + res3b_relu, res3c_branch2a + res3c_branch2a_relu, res3c_branch2b + res3c_branch2b_relu, res3c_branch2c + res3c + res3c_relu, res3d_branch2a + res3d_branch2a_relu, res3d_branch2b + res3d_branch2b_relu, res3d_branch2c + res3d + res3d_relu, res4a_branch2a + res4a_branch2a_relu, res4a_branch2b + res4a_branch2b_relu, res4a_branch2c, res4a_branch1 + res4a + res4a_relu, res4b_branch2a + res4b_branch2a_relu, res4b_branch2b + res4b_branch2b_relu, res4b_branch2c + res4b + res4b_relu, res4c_branch2a + res4c_branch2a_relu, res4c_branch2b + res4c_branch2b_relu, res4c_branch2c + res4c + res4c_relu, res4d_branch2a + res4d_branch2a_relu, res4d_branch2b + res4d_branch2b_relu, res4d_branch2c + res4d + res4d_relu, res4e_branch2a + res4e_branch2a_relu, res4e_branch2b + res4e_branch2b_relu, res4e_branch2c + res4e + res4e_relu, res4f_branch2a + res4f_branch2a_relu, res4f_branch2b + res4f_branch2b_relu, res4f_branch2c + res4f + res4f_relu, res5a_branch2a + res5a_branch2a_relu, res5a_branch2b + res5a_branch2b_relu, res5a_branch2c, res5a_branch1 + res5a + res5a_relu, res5b_branch2a + res5b_branch2a_relu, res5b_branch2b + res5b_branch2b_relu, res5b_branch2c + res5b + res5b_relu, res5c_branch2a + res5c_branch2a_relu, res5c_branch2b + res5c_branch2b_relu, res5c_branch2c + res5c + res5c_relu, pool5, fc1000, prob,
[08/09/2020-08:15:13] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
- Example1 and Example2 run the same model(resnet50) but the layer compositions of them are different. Example1 has bn_conv1 but Example2 doesn’t have. I am wodering why layer composition is different depending on the device type.
- What is the meaning of ‘{}’ in Example1, and the meaning of ‘+’ in Example2?
- Profile on DLA
- I got execution time per layer when I ran trtexec on GPU.
sudo trtexec --deploy=data/caffe_model/alexnet/deploy.prototxt --output=prob --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-06:24:39] [I] === Profile (490 iterations ) ===
[08/09/2020-06:24:39] [I] Layer Time (ms) Avg. Time (ms) Time %
[08/09/2020-06:24:39] [I] conv1 + relu1 217.14 0.44 6.9
[08/09/2020-06:24:39] [I] norm1 44.27 0.09 1.4
[08/09/2020-06:24:39] [I] pool1 19.21 0.04 0.6
[08/09/2020-06:24:39] [I] conv2 + relu2 450.90 0.92 14.4
[08/09/2020-06:24:39] [I] norm2 95.52 0.19 3.0
[08/09/2020-06:24:39] [I] pool2 14.53 0.03 0.5
[08/09/2020-06:24:39] [I] conv3 + relu3 186.97 0.38 6.0
[08/09/2020-06:24:39] [I] conv4 + relu4 182.38 0.37 5.8
[08/09/2020-06:24:39] [I] conv5 + relu5 102.50 0.21 3.3
[08/09/2020-06:24:39] [I] pool5 6.21 0.01 0.2
[08/09/2020-06:24:39] [I] fc6 + relu6 1146.71 2.34 36.5
[08/09/2020-06:24:39] [I] fc7 + relu7 515.00 1.05 16.4
[08/09/2020-06:24:39] [I] fc8 152.44 0.31 4.9
[08/09/2020-06:24:39] [I] prob 7.73 0.02 0.2
[08/09/2020-06:24:39] [I] Total 3141.50 6.41 100.0
- However, when I ran trtexec on DLA, it outputs following profiles.
sudo trtexec --deploy=data/caffe_model/alexnet/deploy.prototxt --output=prob --useDLACore=0 --iterations=50 --allowGPUFallback --dumpProfile
[08/09/2020-06:25:03] [I] Layer Time (ms) Avg. Time (ms) Time %
[08/09/2020-06:25:03] [I] data to nvm 39.47 0.12 1.3
[08/09/2020-06:25:03] [I] {conv1,relu1,norm1,pool1,conv2,relu2,norm2,pool2,conv3,relu3,conv4,relu4,conv5,relu5,pool5,fc6,relu6,fc7,relu7,fc8} 126.55 0.38 4.1
[08/09/2020-06:25:03] [I] data copy finish 21.55 0.06 0.7
[08/09/2020-06:25:03] [I] {conv1,relu1,norm1,pool1,conv2,relu2,norm2,pool2,conv3,relu3,conv4,relu4,conv5,relu5,pool5,fc6,relu6,fc7,relu7,fc8} output reformatter 0 2914.08 8.75 93.7
[08/09/2020-06:25:03] [I] {conv1,relu1,norm1,pool1,conv2,relu2,norm2,pool2,conv3,relu3,conv4,relu4,conv5,relu5,pool5,fc6,relu6,fc7,relu7,fc8} output to be reformatted 0 finish 1.92 0.01 0.1
[08/09/2020-06:25:03] [I] prob 4.68 0.01 0.2
[08/09/2020-06:25:03] [I] prob output reformatter 0 3.15 0.01 0.1
[08/09/2020-06:25:03] [I] Total 3111.40 9.34 100.0
- Why it does not show every execution time per layer? (or how can I record execution time per layer?)
- What is the exact meaning of ‘data to nvm’, ‘data copy finish’, ‘output reformatter 0’, and ‘output to be reformatted 0 finish’?
- Or is there any detailed document related to how DLA works?
Any help will be very appreciated.
Regards.
Environment
TensorRT Version : 7.1.0
nvidia@xavier:~/tf_to_trt_image_classification$ dpkg -l | grep TensorRT
ii graphsurgeon-tf 7.1.0-1+cuda10.2 arm64 GraphSurgeon for TensorRT package
ii libnvinfer-bin 7.1.0-1+cuda10.2 arm64 TensorRT binaries
ii libnvinfer-dev 7.1.0-1+cuda10.2 arm64 TensorRT development libraries and headers
ii libnvinfer-doc 7.1.0-1+cuda10.2 all TensorRT documentation
ii libnvinfer-plugin-dev 7.1.0-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer-plugin7 7.1.0-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer-samples 7.1.0-1+cuda10.2 all TensorRT samples
ii libnvinfer7 7.1.0-1+cuda10.2 arm64 TensorRT runtime libraries
ii libnvonnxparsers-dev 7.1.0-1+cuda10.2 arm64 TensorRT ONNX libraries
ii libnvonnxparsers7 7.1.0-1+cuda10.2 arm64 TensorRT ONNX libraries
ii libnvparsers-dev 7.1.0-1+cuda10.2 arm64 TensorRT parsers libraries
ii libnvparsers7 7.1.0-1+cuda10.2 arm64 TensorRT parsers libraries
ii nvidia-container-csv-tensorrt 7.1.0.16-1+cuda10.2 arm64 Jetpack TensorRT CSV file
ii python-libnvinfer 7.1.0-1+cuda10.2 arm64 Python bindings for TensorRT
ii python-libnvinfer-dev 7.1.0-1+cuda10.2 arm64 Python development package for TensorRT
ii python3-libnvinfer 7.1.0-1+cuda10.2 arm64 Python 3 bindings for TensorRT
ii python3-libnvinfer-dev 7.1.0-1+cuda10.2 arm64 Python 3 development package for TensorRT
ii tensorrt 7.1.0.16-1+cuda10.2 arm64 Meta package of TensorRT
ii uff-converter-tf 7.1.0-1+cuda10.2 arm64 UFF converter for TensorRT package
GPU Type : 512-Core NVIDIA Volta @ 1377MHz
Nvidia Driver Version : ?
CUDA Version : V10.2.89
nvidia@xavier:~/tf_to_trt_image_classification$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_21:14:42_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
CUDNN Version : 8.0.0
nvidia@xavier:~/tf_to_trt_image_classification$ dpkg --list | grep libcudnn
ii libcudnn8 8.0.0.145-1+cuda10.2 arm64 cuDNN runtime libraries
ii libcudnn8-dev 8.0.0.145-1+cuda10.2 arm64 cuDNN development libraries and headers
ii libcudnn8-doc 8.0.0.145-1+cuda10.2 arm64 cuDNN documents and samples
Operating System + Version : ubuntu18.04
Python Version (if applicable) : 3.6.9
TensorFlow Version (if applicable) : 1.15.2