Performance data (latency) for VGG16 layer-by-layer inference

Hello,

I am looking for published performance data (latency in mili-seconds) for Jetson AGX Xavier ( with DLA - Deep Learning Accelerator ) inference processing with a VGG16 CNN network.
Specifically, layer-by-layer latency when executing inference with the VGG16 model, using the ImageNet dataset ( or other similar dataset ).

I am looking for latency data (start of inference processing by a layer to end of processing by the same layer) listed for each layer : for example

CONV1 layer - x1 mili-sec
CONV2 layer - x2 mili-sec

Fully_connected FC8 layer - y_fc8 mili-sec
Fully_connected FC7 layer - y_fc7 mili-sec
Fully_connected FC6 layer - y_fc6 mili-sec

these are the layers I’m interested in. I have a VLSI hardware background and I’m familiar with (multi-cycle) hardware pipeline stages, with start/done processing flags per stage; these start/done flags allow for easy and accurate hardware latency measurements per stage. Intuitively, similar start/done flags for each DNN layer can be used to profile inferencing latency per layer, Perhaps the Xavier AGX DLA has such start/done flags and they have been used by software applications to extract layer-by-layer inference latency ?

I’m aware of these benchmarks :

Edge TPU performance benchmarks | Coral

for a VGG16 model, but they list inference processing latency for the entire VGG16 model, and don’t have a layer-by-layer breakdown of the processing latency.

thank you,
Nick Iliev, Ph.D.
Research Associate
ECE AEON lab
UIC

Hi,

Sorry that we don’t have an official layer-level performance table.
But we test it on our environment for your reference:

[05/12/2021-11:20:50] [I] === Reporting Options ===
[05/12/2021-11:20:50] [I] Verbose: Disabled
[05/12/2021-11:20:50] [I] Averages: 10 inferences
[05/12/2021-11:20:50] [I] Percentile: 99
[05/12/2021-11:20:50] [I] Dump output: Disabled
[05/12/2021-11:20:50] [I] Profile: Enabled
[05/12/2021-11:20:50] [I] Export timing to JSON file: 
[05/12/2021-11:20:50] [I] Export output to JSON file: 
[05/12/2021-11:20:50] [I] Export profile to JSON file: 
[05/12/2021-11:20:50] [I] 
[05/12/2021-11:20:53] [I] Starting inference threads
[05/12/2021-11:20:56] [I] Warmup completed 35 queries over 200 ms
[05/12/2021-11:20:56] [I] Timing trace has 516 queries over 3.00614 s
[05/12/2021-11:20:56] [I] Trace averages of 10 runs:
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.7434 ms - Host latency: 5.78721 ms (end to end 5.79881 ms, enqueue 5.75565 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.75654 ms - Host latency: 5.80091 ms (end to end 5.81107 ms, enqueue 5.7685 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76519 ms - Host latency: 5.80823 ms (end to end 5.81805 ms, enqueue 5.77719 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.80911 ms - Host latency: 5.85218 ms (end to end 5.86201 ms, enqueue 5.82114 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.80177 ms - Host latency: 5.84561 ms (end to end 5.85667 ms, enqueue 5.81425 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.80778 ms - Host latency: 5.8507 ms (end to end 5.8633 ms, enqueue 5.82024 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.79459 ms - Host latency: 5.83488 ms (end to end 5.84675 ms, enqueue 5.80673 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76894 ms - Host latency: 5.81678 ms (end to end 5.82734 ms, enqueue 5.77802 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.79681 ms - Host latency: 5.84048 ms (end to end 5.85104 ms, enqueue 5.80901 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.79918 ms - Host latency: 5.84699 ms (end to end 5.85881 ms, enqueue 5.81207 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.75254 ms - Host latency: 5.79558 ms (end to end 5.80689 ms, enqueue 5.76423 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76989 ms - Host latency: 5.81754 ms (end to end 5.8283 ms, enqueue 5.7817 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77558 ms - Host latency: 5.8185 ms (end to end 5.82889 ms, enqueue 5.78716 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77496 ms - Host latency: 5.82249 ms (end to end 5.83344 ms, enqueue 5.78696 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77123 ms - Host latency: 5.81819 ms (end to end 5.82732 ms, enqueue 5.78214 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.7688 ms - Host latency: 5.81387 ms (end to end 5.82338 ms, enqueue 5.78055 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77784 ms - Host latency: 5.82618 ms (end to end 5.83608 ms, enqueue 5.7895 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.78965 ms - Host latency: 5.83303 ms (end to end 5.84543 ms, enqueue 5.80131 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.81152 ms - Host latency: 5.85663 ms (end to end 5.86583 ms, enqueue 5.82389 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.80918 ms - Host latency: 5.85201 ms (end to end 5.8639 ms, enqueue 5.82128 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.8142 ms - Host latency: 5.85457 ms (end to end 5.86383 ms, enqueue 5.82347 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.801 ms - Host latency: 5.84247 ms (end to end 5.85052 ms, enqueue 5.81188 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.79081 ms - Host latency: 5.82856 ms (end to end 5.839 ms, enqueue 5.80211 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.80482 ms - Host latency: 5.84207 ms (end to end 5.85051 ms, enqueue 5.81573 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77529 ms - Host latency: 5.81272 ms (end to end 5.8217 ms, enqueue 5.78608 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77181 ms - Host latency: 5.81573 ms (end to end 5.82448 ms, enqueue 5.78318 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.73824 ms - Host latency: 5.77811 ms (end to end 5.78761 ms, enqueue 5.75024 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.74836 ms - Host latency: 5.78588 ms (end to end 5.79558 ms, enqueue 5.76011 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.7318 ms - Host latency: 5.77083 ms (end to end 5.78077 ms, enqueue 5.74376 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.71219 ms - Host latency: 5.74891 ms (end to end 5.75865 ms, enqueue 5.72341 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.71383 ms - Host latency: 5.75094 ms (end to end 5.76313 ms, enqueue 5.72471 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.70685 ms - Host latency: 5.74347 ms (end to end 5.7527 ms, enqueue 5.71509 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.70798 ms - Host latency: 5.74497 ms (end to end 5.75405 ms, enqueue 5.7196 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.7188 ms - Host latency: 5.75696 ms (end to end 5.7658 ms, enqueue 5.73062 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.71694 ms - Host latency: 5.75457 ms (end to end 5.76448 ms, enqueue 5.72749 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.74805 ms - Host latency: 5.7853 ms (end to end 5.7939 ms, enqueue 5.75872 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.7511 ms - Host latency: 5.78784 ms (end to end 5.79885 ms, enqueue 5.76294 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.73958 ms - Host latency: 5.77756 ms (end to end 5.78687 ms, enqueue 5.75159 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.74224 ms - Host latency: 5.77869 ms (end to end 5.78916 ms, enqueue 5.75388 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.7207 ms - Host latency: 5.75894 ms (end to end 5.76848 ms, enqueue 5.73252 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76091 ms - Host latency: 5.79902 ms (end to end 5.80967 ms, enqueue 5.77344 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76799 ms - Host latency: 5.80657 ms (end to end 5.8166 ms, enqueue 5.77991 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76272 ms - Host latency: 5.80269 ms (end to end 5.81199 ms, enqueue 5.77437 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.78333 ms - Host latency: 5.82004 ms (end to end 5.82986 ms, enqueue 5.79497 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76565 ms - Host latency: 5.803 ms (end to end 5.81235 ms, enqueue 5.77744 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77095 ms - Host latency: 5.81038 ms (end to end 5.81963 ms, enqueue 5.78267 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77161 ms - Host latency: 5.80715 ms (end to end 5.81814 ms, enqueue 5.78337 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.77673 ms - Host latency: 5.81448 ms (end to end 5.82356 ms, enqueue 5.7874 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76128 ms - Host latency: 5.79878 ms (end to end 5.80835 ms, enqueue 5.77324 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76304 ms - Host latency: 5.79861 ms (end to end 5.80881 ms, enqueue 5.77502 ms)
[05/12/2021-11:20:56] [I] Average on 10 runs - GPU latency: 5.76699 ms - Host latency: 5.80432 ms (end to end 5.81375 ms, enqueue 5.77759 ms)
[05/12/2021-11:20:56] [I] Host Latency
[05/12/2021-11:20:56] [I] min: 5.71167 ms (end to end 5.72095 ms)
[05/12/2021-11:20:56] [I] max: 5.92725 ms (end to end 5.94067 ms)
[05/12/2021-11:20:56] [I] mean: 5.8062 ms (end to end 5.81627 ms)
[05/12/2021-11:20:56] [I] median: 5.80753 ms (end to end 5.81715 ms)
[05/12/2021-11:20:56] [I] percentile: 5.89606 ms at 99% (end to end 5.90784 ms at 99%)
[05/12/2021-11:20:56] [I] throughput: 171.649 qps
[05/12/2021-11:20:56] [I] walltime: 3.00614 s
[05/12/2021-11:20:56] [I] Enqueue Time
[05/12/2021-11:20:56] [I] min: 5.68384 ms
[05/12/2021-11:20:56] [I] max: 5.90491 ms
[05/12/2021-11:20:56] [I] median: 5.77722 ms
[05/12/2021-11:20:56] [I] GPU Compute
[05/12/2021-11:20:56] [I] min: 5.67151 ms
[05/12/2021-11:20:56] [I] max: 5.88928 ms
[05/12/2021-11:20:56] [I] mean: 5.76558 ms
[05/12/2021-11:20:56] [I] median: 5.76587 ms
[05/12/2021-11:20:56] [I] percentile: 5.84546 ms at 99%
[05/12/2021-11:20:56] [I] total compute time: 2.97504 s
[05/12/2021-11:20:56] [I] 
[05/12/2021-11:20:56] [I] === Profile (551 iterations ) ===
[05/12/2021-11:20:56] [I]                                  Layer   Time (ms)   Avg. Time (ms)   Time %
[05/12/2021-11:20:56] [I]  conv1_1 + relu1_1 input reformatter 0       10.92             0.02      0.3
[05/12/2021-11:20:56] [I]                      conv1_1 + relu1_1       72.99             0.13      2.3
[05/12/2021-11:20:56] [I]                      conv1_2 + relu1_2      146.31             0.27      4.6
[05/12/2021-11:20:56] [I]                                  pool1       33.36             0.06      1.1
[05/12/2021-11:20:56] [I]                      conv2_1 + relu2_1       73.56             0.13      2.3
[05/12/2021-11:20:56] [I]                      conv2_2 + relu2_2      127.90             0.23      4.0
[05/12/2021-11:20:56] [I]                                  pool2       19.17             0.03      0.6
[05/12/2021-11:20:56] [I]                      conv3_1 + relu3_1       71.49             0.13      2.3
[05/12/2021-11:20:56] [I]                      conv3_2 + relu3_2      130.14             0.24      4.1
[05/12/2021-11:20:56] [I]                      conv3_3 + relu3_3      130.53             0.24      4.1
[05/12/2021-11:20:56] [I]                      conv3_4 + relu3_4      130.51             0.24      4.1
[05/12/2021-11:20:56] [I]                                  pool3       11.56             0.02      0.4
[05/12/2021-11:20:56] [I]                      conv4_1 + relu4_1       72.02             0.13      2.3
[05/12/2021-11:20:56] [I]                      conv4_2 + relu4_2      133.73             0.24      4.2
[05/12/2021-11:20:56] [I]                      conv4_3 + relu4_3      134.50             0.24      4.3
[05/12/2021-11:20:56] [I]                      conv4_4 + relu4_4      133.70             0.24      4.2
[05/12/2021-11:20:56] [I]                                  pool4        7.04             0.01      0.2
[05/12/2021-11:20:56] [I]                      conv5_1 + relu5_1       55.98             0.10      1.8
[05/12/2021-11:20:56] [I]                      conv5_2 + relu5_2       55.64             0.10      1.8
[05/12/2021-11:20:56] [I]                      conv5_3 + relu5_3       54.66             0.10      1.7
[05/12/2021-11:20:56] [I]                      conv5_4 + relu5_4       56.43             0.10      1.8
[05/12/2021-11:20:56] [I]                                  pool5        3.35             0.01      0.1
[05/12/2021-11:20:56] [I]        fc6 + relu6 input reformatter 0        4.91             0.01      0.2
[05/12/2021-11:20:56] [I]                            fc6 + relu6     1209.04             2.19     38.2
[05/12/2021-11:20:56] [I]                            fc7 + relu7      207.44             0.38      6.6
[05/12/2021-11:20:56] [I]                                    fc8       68.68             0.12      2.2
[05/12/2021-11:20:56] [I]                                   prob        3.52             0.01      0.1
[05/12/2021-11:20:56] [I]              prob output reformatter 0        2.84             0.01      0.1
[05/12/2021-11:20:56] [I]                                  Total     3161.93             5.74    100.0

Thanks.

This is great data, thank you ! I can see total time ( in ms ) for 551 iterations in column 1, and the corresponding average time in column 2. To clarify, this is with the DLA (Deep Learning Accelerator) enabled for all VGG16 processing, correct ?

Hi,

Sorry that the result is tested with GPU.
We are going to get a DLA result and share with you later.

Thanks.

Hi,

First, since not all the layers are supported by DLA, some are fallback to GPU implementation.

[05/25/2021-16:05:23] [I] [TRT] --------------- Layers running on DLA: 
[05/25/2021-16:05:23] [I] [TRT] {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5}, {relu6,fc7,relu7,fc8}, 
[05/25/2021-16:05:23] [I] [TRT] --------------- Layers running on GPU: 
[05/25/2021-16:05:23] [I] [TRT] fc6, prob, 

Below is the layer-level performance:

[05/25/2021-16:11:33] [I] === Profile (135 iterations ) ===
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                            Layer   Time (ms)   Avg. Time (ms)   Time %
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                      data to nvm       13.58             0.10      0.4
[05/25/2021-16:11:33] [I]  {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5}       42.47             0.31      1.3
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                 data copy finish        3.51             0.03      0.1
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                          fc6 input reformatter 0     2639.01            19.55     81.8
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                     pool5 finish        0.51             0.00      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                              fc6      297.70             2.21      9.2
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                        {relu6,fc7,relu7,fc8} input reformatter 0        0.70             0.01      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                            {relu6,fc7,relu7,fc8}        0.27             0.00      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                 {relu6,fc7,relu7,fc8} reformatted input 0 finish        0.06             0.00      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                       {relu6,fc7,relu7,fc8} output reformatter 0      224.59             1.66      7.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                          {relu6,fc7,relu7,fc8} output to be reformatted 0 finish        0.42             0.00      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                             prob        1.05             0.01      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                        prob output reformatter 0        0.69             0.01      0.0
[05/25/2021-16:11:33] [I]                                                                                                                                                                                                                                                                                            Total     3224.55            23.89    100.0
[05/25/2021-16:11:33] [I] 
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=vgg19_N2.prototxt --output=prob --workspace=4096 --best --dumpProfile --useDLACore=0 --allowGPUFallback

Not sure if this can meet your requirement.

But please noted that DLA is an hardware component.
We can only measure the time between launch and return of a DLA task.

Thanks.

Hi,

thanks for the great DLA based layer-by-layer profiling. I have two comments :

  1. the total time (3224.55 ms for 135 iterations) seems higher than the all-GPU based result which used 551 iterations. What’s the total time for the DLA based solution for 551 iterations ? Intuitively the GPU+DLA solution should be faster than the all-GPU solution but I may be missing something.

  2. is it possible to capture the DLA’s latency for the fc8 layer alone, for the same 551 iterations ?

thank you,
Nick Iliev, Ph.D.
Research Associate
ECE AEON lab
UIC

Hi,

Our benchmark is measured in time.
For example, the default value is 3000ms, then it allows DLA to run 135 iteration and GPU for 551 iteration.

DLA’s target is to offload the GPU jobs rather than performance.
More, the resource in DLA is much limited than GPU.
This indicates that a task will wait more frequently on DLA for the resources.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#troubleshooting

Q: Why does my network run slower when using DLA compared to without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation may be more performant. Which implementation to use depends on your latency or throughput requirements and your power budget. Since all DLA engines are independent from the GPU and each other, you could also use both implementations at the same time to further increase the throughput of your network.

We need to arrange a device for the experiment.
Will share the remaining data with you later.

Thanks.

Hi,

Sorry for keeping you waiting.
Here is the remaining experiment for your reference:

1. VGG16 for DLA with iteration 551.

Since trtexec is measure by time, the real iteration is 559.

Layer replacement
[06/29/2021-14:49:32] [I] [TRT] --------------- Layers running on DLA:
[06/29/2021-14:49:32] [I] [TRT] {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5}, {relu6,fc7,relu7,fc8},
[06/29/2021-14:49:32] [I] [TRT] --------------- Layers running on GPU:
[06/29/2021-14:49:32] [I] [TRT] fc6, prob,
Profiling report
06/29/2021-14:55:53] [I] === Profile (559 iterations ) ===
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                              Layer   Time (ms)   Avg. Time (ms)   Time %
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                        data to nvm       44.14             0.08      0.3
[06/29/2021-14:55:53] [I]                                    {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5}      153.77             0.28      1.2
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                   data copy finish       23.59             0.04      0.2
[06/29/2021-14:55:53] [I]               {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5} output reformatter 0    10880.31            19.46     82.3
[06/29/2021-14:55:53] [I]  {conv1_1,relu1_1,conv1_2,relu1_2,pool1,conv2_1,relu2_1,conv2_2,relu2_2,pool2,conv3_1,relu3_1,conv3_2,relu3_2,conv3_3,relu3_3,conv3_4,relu3_4,pool3,conv4_1,relu4_1,conv4_2,relu4_2,conv4_3,relu4_3,conv4_4,relu4_4,pool4,conv5_1,relu5_1,conv5_2,relu5_2,conv5_3,relu5_3,conv5_4,relu5_4,pool5} output to be reformatted 0 finish        2.09             0.00      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                                fc6     1176.52             2.10      8.9
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                           fc6 output reformatter 0        2.89             0.01      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                              {relu6,fc7,relu7,fc8}        1.04             0.00      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                (Unnamed Layer* 37) [Fully Connected]_output finish        0.28             0.00      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                           prob input reformatter 0      924.98             1.65      7.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                         fc8 finish        1.73             0.00      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                               prob        4.14             0.01      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                          prob output reformatter 0        3.08             0.01      0.0
[06/29/2021-14:55:53] [I]                                                                                                                                                                                                                                                                                                                              Total    13218.55            23.65    100.0
[06/29/2021-14:55:53] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=vgg19_N2.prototxt --output=prob --workspace=4096 --best --dumpProfile --useDLACore=0 --allowGPUFallback --iterations=551

2. Profiling fc8 only.

Layer architecture

We change the model into the below architecture to capture the inference time of fc8:

name: "VGG_ILSVRC_19_layers"
input: "data"
input_dim: 2
input_dim: 4096
input_dim: 1
input_dim: 1
layer {
  bottom: "data"
  top: "fc8"
  name: "fc8"
  type: "InnerProduct"
  inner_product_param {
    num_output: 1000
  }
}

Since the inference time is too short for the iteration parameter.
We apply this with GPU and DLA for your comparison:

Inference time with GPU
[06/29/2021-15:01:58] [I] Host Latency
[06/29/2021-15:01:58] [I] min: 0.17334 ms (end to end 0.182617 ms)
[06/29/2021-15:01:58] [I] max: 0.357941 ms (end to end 0.367828 ms)
[06/29/2021-15:01:58] [I] mean: 0.211059 ms (end to end 0.221694 ms)
[06/29/2021-15:01:58] [I] median: 0.208618 ms (end to end 0.219238 ms)
[06/29/2021-15:01:58] [I] percentile: 0.255096 ms at 99% (end to end 0.267029 ms at 99%)
[06/29/2021-15:01:58] [I] throughput: 4303.13 qps
[06/29/2021-15:01:58] [I] walltime: 3.00038 s
[06/29/2021-15:01:58] [I] Enqueue Time
[06/29/2021-15:01:58] [I] min: 0.148926 ms
[06/29/2021-15:01:58] [I] max: 0.330475 ms
[06/29/2021-15:01:58] [I] median: 0.178711 ms
[06/29/2021-15:01:58] [I] GPU Compute
[06/29/2021-15:01:58] [I] min: 0.151611 ms
[06/29/2021-15:01:58] [I] max: 0.33432 ms
[06/29/2021-15:01:58] [I] mean: 0.185042 ms
[06/29/2021-15:01:58] [I] median: 0.182251 ms
[06/29/2021-15:01:58] [I] percentile: 0.222656 ms at 99%
[06/29/2021-15:01:58] [I] total compute time: 2.38908 s
[06/29/2021-15:01:58] [I]
[06/29/2021-15:01:58] [I] === Profile (13744 iterations ) ===
[06/29/2021-15:01:58] [I]                     Layer   Time (ms)   Avg. Time (ms)   Time %
[06/29/2021-15:01:58] [I]   fc8 input reformatter 0      272.52             0.02     14.0
[06/29/2021-15:01:58] [I]                       fc8     1612.53             0.12     83.1
[06/29/2021-15:01:58] [I]  fc8 output reformatter 0       55.67             0.00      2.9
[06/29/2021-15:01:58] [I]                     Total     1940.72             0.14    100.0
[06/29/2021-15:01:58] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=fc8.prototxt --output=fc8 --workspace=4096 --best --dumpProfile --iterations=551
Inference time with DLA
[06/29/2021-15:10:04] [I] Host Latency
[06/29/2021-15:10:04] [I] min: 0.476074 ms (end to end 0.487793 ms)
[06/29/2021-15:10:04] [I] max: 1.72412 ms (end to end 1.74268 ms)
[06/29/2021-15:10:04] [I] mean: 0.628462 ms (end to end 0.640405 ms)
[06/29/2021-15:10:04] [I] median: 0.618408 ms (end to end 0.629883 ms)
[06/29/2021-15:10:04] [I] percentile: 0.797363 ms at 99% (end to end 0.812988 ms at 99%)
[06/29/2021-15:10:04] [I] throughput: 1499.15 qps
[06/29/2021-15:10:04] [I] walltime: 9.16789 s
[06/29/2021-15:10:04] [I] Enqueue Time
[06/29/2021-15:10:04] [I] min: 0.512695 ms
[06/29/2021-15:10:04] [I] max: 1.66455 ms
[06/29/2021-15:10:04] [I] median: 0.584229 ms
[06/29/2021-15:10:04] [I] GPU Compute
[06/29/2021-15:10:04] [I] min: 0.42041 ms
[06/29/2021-15:10:04] [I] max: 1.67871 ms
[06/29/2021-15:10:04] [I] mean: 0.597904 ms
[06/29/2021-15:10:04] [I] median: 0.588623 ms
[06/29/2021-15:10:04] [I] percentile: 0.753418 ms at 99%
[06/29/2021-15:10:04] [I] total compute time: 8.2176 s
[06/29/2021-15:10:04] [I]
[06/29/2021-15:10:04] [I] === Profile (14023 iterations ) ===
[06/29/2021-15:10:04] [I]             Layer   Time (ms)   Avg. Time (ms)   Time %
[06/29/2021-15:10:04] [I]       data to nvm      382.51             0.03      5.4
[06/29/2021-15:10:04] [I]             {fc8}     2463.73             0.18     34.5
[06/29/2021-15:10:04] [I]  data copy finish      272.82             0.02      3.8
[06/29/2021-15:10:04] [I]      fc8 from nvm     3975.76             0.28     55.7
[06/29/2021-15:10:04] [I]   fc8 copy finish       44.96             0.00      0.6
[06/29/2021-15:10:04] [I]             Total     7139.79             0.51    100.0
[06/29/2021-15:10:04] [I]
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=fc8.prototxt --output=fc8 --workspace=4096 --best --dumpProfile --useDLACore=0 --allowGPUFallback --iterations=13744

Thanks.

Hi,

that’s the data I was looking for. Thank you very much.

-Nick

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.