Hi,
Sorry for keeping you waiting.
Here is the remaining experiment for your reference:
1. VGG16 for DLA with iteration 551.
Since trtexec is measure by time, the real iteration is 559.
Layer replacement
[06/29/2021-14:49:32] [I] [TRT] --------------- Layers running on DLA:
[06/29/2021-14:49:32] [I] [TRT] {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5}, {relu6,fc7,relu7,fc8},
[06/29/2021-14:49:32] [I] [TRT] --------------- Layers running on GPU:
[06/29/2021-14:49:32] [I] [TRT] fc6, prob,
Profiling report
06/29/2021-14:55:53] [I] === Profile (559 iterations ) ===
[06/29/2021-14:55:53] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/29/2021-14:55:53] [I] data to nvm 44.14 0.08 0.3
[06/29/2021-14:55:53] [I] {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5} 153.77 0.28 1.2
[06/29/2021-14:55:53] [I] data copy finish 23.59 0.04 0.2
[06/29/2021-14:55:53] [I] {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5} output reformatter 0 10880.31 19.46 82.3
[06/29/2021-14:55:53] [I] {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5} output to be reformatted 0 finish 2.09 0.00 0.0
[06/29/2021-14:55:53] [I] fc6 1176.52 2.10 8.9
[06/29/2021-14:55:53] [I] fc6 output reformatter 0 2.89 0.01 0.0
[06/29/2021-14:55:53] [I] {relu6,fc7,relu7,fc8} 1.04 0.00 0.0
[06/29/2021-14:55:53] [I] (Unnamed Layer* 37) [Fully Connected]_output finish 0.28 0.00 0.0
[06/29/2021-14:55:53] [I] prob input reformatter 0 924.98 1.65 7.0
[06/29/2021-14:55:53] [I] fc8 finish 1.73 0.00 0.0
[06/29/2021-14:55:53] [I] prob 4.14 0.01 0.0
[06/29/2021-14:55:53] [I] prob output reformatter 0 3.08 0.01 0.0
[06/29/2021-14:55:53] [I] Total 13218.55 23.65 100.0
[06/29/2021-14:55:53] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=vgg19_N2.prototxt --output=prob --workspace=4096 --best --dumpProfile --useDLACore=0 --allowGPUFallback --iterations=551
2. Profiling fc8 only.
Layer architecture
We change the model into the below architecture to capture the inference time of fc8:
name: "VGG_ILSVRC_19_layers"
input: "data"
input_dim: 2
input_dim: 4096
input_dim: 1
input_dim: 1
layer {
bottom: "data"
top: "fc8"
name: "fc8"
type: "InnerProduct"
inner_product_param {
num_output: 1000
}
}
Since the inference time is too short for the iteration parameter.
We apply this with GPU and DLA for your comparison:
Inference time with GPU
[06/29/2021-15:01:58] [I] Host Latency
[06/29/2021-15:01:58] [I] min: 0.17334 ms (end to end 0.182617 ms)
[06/29/2021-15:01:58] [I] max: 0.357941 ms (end to end 0.367828 ms)
[06/29/2021-15:01:58] [I] mean: 0.211059 ms (end to end 0.221694 ms)
[06/29/2021-15:01:58] [I] median: 0.208618 ms (end to end 0.219238 ms)
[06/29/2021-15:01:58] [I] percentile: 0.255096 ms at 99% (end to end 0.267029 ms at 99%)
[06/29/2021-15:01:58] [I] throughput: 4303.13 qps
[06/29/2021-15:01:58] [I] walltime: 3.00038 s
[06/29/2021-15:01:58] [I] Enqueue Time
[06/29/2021-15:01:58] [I] min: 0.148926 ms
[06/29/2021-15:01:58] [I] max: 0.330475 ms
[06/29/2021-15:01:58] [I] median: 0.178711 ms
[06/29/2021-15:01:58] [I] GPU Compute
[06/29/2021-15:01:58] [I] min: 0.151611 ms
[06/29/2021-15:01:58] [I] max: 0.33432 ms
[06/29/2021-15:01:58] [I] mean: 0.185042 ms
[06/29/2021-15:01:58] [I] median: 0.182251 ms
[06/29/2021-15:01:58] [I] percentile: 0.222656 ms at 99%
[06/29/2021-15:01:58] [I] total compute time: 2.38908 s
[06/29/2021-15:01:58] [I]
[06/29/2021-15:01:58] [I] === Profile (13744 iterations ) ===
[06/29/2021-15:01:58] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/29/2021-15:01:58] [I] fc8 input reformatter 0 272.52 0.02 14.0
[06/29/2021-15:01:58] [I] fc8 1612.53 0.12 83.1
[06/29/2021-15:01:58] [I] fc8 output reformatter 0 55.67 0.00 2.9
[06/29/2021-15:01:58] [I] Total 1940.72 0.14 100.0
[06/29/2021-15:01:58] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=fc8.prototxt --output=fc8 --workspace=4096 --best --dumpProfile --iterations=551
Inference time with DLA
[06/29/2021-15:10:04] [I] Host Latency
[06/29/2021-15:10:04] [I] min: 0.476074 ms (end to end 0.487793 ms)
[06/29/2021-15:10:04] [I] max: 1.72412 ms (end to end 1.74268 ms)
[06/29/2021-15:10:04] [I] mean: 0.628462 ms (end to end 0.640405 ms)
[06/29/2021-15:10:04] [I] median: 0.618408 ms (end to end 0.629883 ms)
[06/29/2021-15:10:04] [I] percentile: 0.797363 ms at 99% (end to end 0.812988 ms at 99%)
[06/29/2021-15:10:04] [I] throughput: 1499.15 qps
[06/29/2021-15:10:04] [I] walltime: 9.16789 s
[06/29/2021-15:10:04] [I] Enqueue Time
[06/29/2021-15:10:04] [I] min: 0.512695 ms
[06/29/2021-15:10:04] [I] max: 1.66455 ms
[06/29/2021-15:10:04] [I] median: 0.584229 ms
[06/29/2021-15:10:04] [I] GPU Compute
[06/29/2021-15:10:04] [I] min: 0.42041 ms
[06/29/2021-15:10:04] [I] max: 1.67871 ms
[06/29/2021-15:10:04] [I] mean: 0.597904 ms
[06/29/2021-15:10:04] [I] median: 0.588623 ms
[06/29/2021-15:10:04] [I] percentile: 0.753418 ms at 99%
[06/29/2021-15:10:04] [I] total compute time: 8.2176 s
[06/29/2021-15:10:04] [I]
[06/29/2021-15:10:04] [I] === Profile (14023 iterations ) ===
[06/29/2021-15:10:04] [I] Layer Time (ms) Avg. Time (ms) Time %
[06/29/2021-15:10:04] [I] data to nvm 382.51 0.03 5.4
[06/29/2021-15:10:04] [I] {fc8} 2463.73 0.18 34.5
[06/29/2021-15:10:04] [I] data copy finish 272.82 0.02 3.8
[06/29/2021-15:10:04] [I] fc8 from nvm 3975.76 0.28 55.7
[06/29/2021-15:10:04] [I] fc8 copy finish 44.96 0.00 0.6
[06/29/2021-15:10:04] [I] Total 7139.79 0.51 100.0
[06/29/2021-15:10:04] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=fc8.prototxt --output=fc8 --workspace=4096 --best --dumpProfile --useDLACore=0 --allowGPUFallback --iterations=13744
Thanks.