slower when change DefaultDeviceType from GPU to DLA?

I cloned the repo https://github.com/dusty-nv/jetson-inference for object detection on Jetson Xavier.
The network I used is DetectNet.
The engine is run on GPU by default, so I wanted to try run inference on DLA.
I modified some default parameters for function DetectNet::create() in the file named detectNet.h:
2 to 1 for “maxBatchSize” and DEVICE_GPU to DEVICE_DLA_0 for “device”.

But it seemed that the speed become slower.
for DLA:[TRT] layer network time - 44.175682 ms
for GPU:[TRT] layer network time - 10.079423 ms

And blow is the detailed information:
GPU:

nvidia@jetson-0423518027970:~/jetson-inference/build/aarch64/bin$ ./detectnet-console dog_0.jpg output_0.jpg coco-dog
detectnet-console
  args (4):  0 [./detectnet-console]  1 [dog_0.jpg]  2 [output_0.jpg]  3 [coco-dog]  


detectNet -- loading detection network model from:
          -- prototxt     networks/DetectNet-COCO-Dog/deploy.prototxt
          -- model        networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel
          -- input_blob   'data'
          -- output_cvg   'coverage'
          -- output_bbox  'bboxes'
          -- mean_pixel   0.000000
          -- class_labels networks/DetectNet-COCO-Dog/class_labels.txt
          -- threshold    0.500000
          -- batch_size   2

[TRT]  TensorRT version 5.0.3
[TRT]  detected model format - caffe  (extension '.caffemodel')
[TRT]  desired precision specified for GPU: FASTEST
[TRT]  requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT]  native precisions detected for GPU:  FP32, FP16, INT8
[TRT]  selecting fastest native precision for GPU:  FP16
[TRT]  attempting to open engine cache file networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel.2.1.GPU.FP16.engine
[TRT]  loading network profile from engine cache... networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel.2.1.GPU.FP16.engine
[TRT]  device GPU, networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel loaded
[TRT]  device GPU, CUDA engine context initialized with 3 bindings
[TRT]  binding -- index   0
               -- name    'data'
               -- type    FP32
               -- in/out  INPUT
               -- # dims  3
               -- dim #0  3 (CHANNEL)
               -- dim #1  640 (SPATIAL)
               -- dim #2  640 (SPATIAL)
[TRT]  binding -- index   1
               -- name    'coverage'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  1 (CHANNEL)
               -- dim #1  40 (SPATIAL)
               -- dim #2  40 (SPATIAL)
[TRT]  binding -- index   2
               -- name    'bboxes'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  4 (CHANNEL)
               -- dim #1  40 (SPATIAL)
               -- dim #2  40 (SPATIAL)
[TRT]  binding to input 0 data  binding index:  0
[TRT]  binding to input 0 data  dims (b=2 c=3 h=640 w=640) size=9830400
[cuda]  cudaAllocMapped 9830400 bytes, CPU 0x21e7f6000 GPU 0x21e7f6000
[TRT]  binding to output 0 coverage  binding index:  1
[TRT]  binding to output 0 coverage  dims (b=2 c=1 h=40 w=40) size=12800
[cuda]  cudaAllocMapped 12800 bytes, CPU 0x21f156000 GPU 0x21f156000
[TRT]  binding to output 1 bboxes  binding index:  2
[TRT]  binding to output 1 bboxes  dims (b=2 c=4 h=40 w=40) size=51200
[cuda]  cudaAllocMapped 51200 bytes, CPU 0x21f356000 GPU 0x21f356000
device GPU, networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel initialized.
[cuda]  cudaAllocMapped 16 bytes, CPU 0x2177b6200 GPU 0x2177b6200
detectNet -- model has 1 object classes
detectNet -- failed to find networks/DetectNet-COCO-Dog/class_labels.txt
detectNet -- maximum bounding boxes:  6400
[cuda]  cudaAllocMapped 102400 bytes, CPU 0x21f556000 GPU 0x21f556000
[cuda]  cudaAllocMapped 25600 bytes, CPU 0x21f362800 GPU 0x21f362800
loaded image  dog_0.jpg  (2049 x 1120)  36718080 bytes
[cuda]  cudaAllocMapped 36718080 bytes, CPU 0x21f756000 GPU 0x21f756000
detectnet-console:  beginning processing network (1553239072906)
[TRT]  layer deploy_transform - 0.175968 ms
[TRT]  layer conv1/7x7_s2 + conv1/relu_7x7 input reformatter 0 - 0.141504 ms
[TRT]  layer conv1/7x7_s2 + conv1/relu_7x7 - 0.898880 ms
[TRT]  layer pool1/3x3_s2 - 0.170240 ms
[TRT]  layer pool1/norm1 input reformatter 0 - 0.071424 ms
[TRT]  layer pool1/norm1 - 0.127328 ms
[TRT]  layer conv2/3x3_reduce + conv2/relu_3x3_reduce input reformatter 0 - 0.082592 ms
[TRT]  layer conv2/3x3_reduce + conv2/relu_3x3_reduce - 0.105472 ms
[TRT]  layer conv2/3x3 + conv2/relu_3x3 - 0.948544 ms
[TRT]  layer conv2/norm2 input reformatter 0 - 0.191616 ms
[TRT]  layer conv2/norm2 - 0.349952 ms
[TRT]  layer pool2/3x3_s2 input reformatter 0 - 0.227488 ms
[TRT]  layer pool2/3x3_s2 - 0.134816 ms
[TRT]  layer inception_3a/1x1 + inception_3a/relu_1x1 || inception_3a/3x3_reduce + inception_3a/relu_3x3_reduce || inception_3a/5x5_reduce + inception_3a/relu_5x5_reduce - 0.131072 ms
[TRT]  layer inception_3a/3x3 + inception_3a/relu_3x3 - 0.234880 ms
[TRT]  layer inception_3a/5x5 + inception_3a/relu_5x5 - 0.116352 ms
[TRT]  layer inception_3a/pool - 0.091136 ms
[TRT]  layer inception_3a/pool_proj + inception_3a/relu_pool_proj - 0.048224 ms
[TRT]  layer inception_3a/1x1 copy - 0.043936 ms
[TRT]  layer inception_3b/1x1 + inception_3b/relu_1x1 || inception_3b/3x3_reduce + inception_3b/relu_3x3_reduce || inception_3b/5x5_reduce + inception_3b/relu_5x5_reduce - 0.246176 ms
[TRT]  layer inception_3b/3x3 + inception_3b/relu_3x3 - 0.456288 ms
[TRT]  layer inception_3b/5x5 + inception_3b/relu_5x5 - 0.196928 ms
[TRT]  layer inception_3b/pool - 0.124672 ms
[TRT]  layer inception_3b/pool_proj + inception_3b/relu_pool_proj - 0.081856 ms
[TRT]  layer inception_3b/1x1 copy - 0.038912 ms
[TRT]  layer pool3/3x3_s2 - 0.457728 ms
[TRT]  layer inception_4a/1x1 + inception_4a/relu_1x1 || inception_4a/3x3_reduce + inception_4a/relu_3x3_reduce || inception_4a/5x5_reduce + inception_4a/relu_5x5_reduce - 0.115712 ms
[TRT]  layer inception_4a/3x3 + inception_4a/relu_3x3 - 0.115200 ms
[TRT]  layer inception_4a/5x5 + inception_4a/relu_5x5 - 0.045920 ms
[TRT]  layer inception_4a/pool - 0.069408 ms
[TRT]  layer inception_4a/pool_proj + inception_4a/relu_pool_proj - 0.038784 ms
[TRT]  layer inception_4a/1x1 copy - 0.016640 ms
[TRT]  layer inception_4b/1x1 + inception_4b/relu_1x1 || inception_4b/3x3_reduce + inception_4b/relu_3x3_reduce || inception_4b/5x5_reduce + inception_4b/relu_5x5_reduce - 0.113408 ms
[TRT]  layer inception_4b/3x3 + inception_4b/relu_3x3 - 0.146432 ms
[TRT]  layer inception_4b/5x5 + inception_4b/relu_5x5 - 0.039936 ms
[TRT]  layer inception_4b/pool - 0.072032 ms
[TRT]  layer inception_4b/pool_proj + inception_4b/relu_pool_proj - 0.048000 ms
[TRT]  layer inception_4b/1x1 copy - 0.015392 ms
[TRT]  layer inception_4c/1x1 + inception_4c/relu_1x1 || inception_4c/3x3_reduce + inception_4c/relu_3x3_reduce || inception_4c/5x5_reduce + inception_4c/relu_5x5_reduce - 0.116544 ms
[TRT]  layer inception_4c/3x3 + inception_4c/relu_3x3 - 0.146688 ms
[TRT]  layer inception_4c/5x5 + inception_4c/relu_5x5 - 0.044736 ms
[TRT]  layer inception_4c/pool - 0.073728 ms
[TRT]  layer inception_4c/pool_proj + inception_4c/relu_pool_proj - 0.044576 ms
[TRT]  layer inception_4c/1x1 copy - 0.013792 ms
[TRT]  layer inception_4d/1x1 + inception_4d/relu_1x1 || inception_4d/3x3_reduce + inception_4d/relu_3x3_reduce || inception_4d/5x5_reduce + inception_4d/relu_5x5_reduce - 0.114944 ms
[TRT]  layer inception_4d/3x3 + inception_4d/relu_3x3 - 0.254720 ms
[TRT]  layer inception_4d/5x5 + inception_4d/relu_5x5 - 0.043008 ms
[TRT]  layer inception_4d/pool - 0.073728 ms
[TRT]  layer inception_4d/pool_proj + inception_4d/relu_pool_proj - 0.045184 ms
[TRT]  layer inception_4d/1x1 copy - 0.013440 ms
[TRT]  layer inception_4e/1x1 + inception_4e/relu_1x1 || inception_4e/3x3_reduce + inception_4e/relu_3x3_reduce || inception_4e/5x5_reduce + inception_4e/relu_5x5_reduce - 0.162784 ms
[TRT]  layer inception_4e/3x3 + inception_4e/relu_3x3 - 0.246560 ms
[TRT]  layer inception_4e/5x5 + inception_4e/relu_5x5 - 0.053664 ms
[TRT]  layer inception_4e/pool - 0.076832 ms
[TRT]  layer inception_4e/pool_proj + inception_4e/relu_pool_proj - 0.045152 ms
[TRT]  layer inception_4e/1x1 copy - 0.020032 ms
[TRT]  layer inception_5a/1x1 + inception_5a/relu_1x1 || inception_5a/3x3_reduce + inception_5a/relu_3x3_reduce || inception_5a/5x5_reduce + inception_5a/relu_5x5_reduce - 0.227264 ms
[TRT]  layer inception_5a/3x3 + inception_5a/relu_3x3 - 0.246752 ms
[TRT]  layer inception_5a/5x5 + inception_5a/relu_5x5 - 0.053600 ms
[TRT]  layer inception_5a/pool - 0.115392 ms
[TRT]  layer inception_5a/pool_proj + inception_5a/relu_pool_proj - 0.064800 ms
[TRT]  layer inception_5a/1x1 copy - 0.019360 ms
[TRT]  layer inception_5b/1x1 + inception_5b/relu_1x1 || inception_5b/3x3_reduce + inception_5b/relu_3x3_reduce || inception_5b/5x5_reduce + inception_5b/relu_5x5_reduce - 0.290592 ms
[TRT]  layer inception_5b/3x3 + inception_5b/relu_3x3 - 0.302112 ms
[TRT]  layer inception_5b/5x5 + inception_5b/relu_5x5 - 0.099296 ms
[TRT]  layer inception_5b/pool - 0.115200 ms
[TRT]  layer inception_5b/pool_proj + inception_5b/relu_pool_proj - 0.063616 ms
[TRT]  layer inception_5b/1x1 copy - 0.028064 ms
[TRT]  layer cvg/classifier - 0.049120 ms
[TRT]  layer coverage/sig input reformatter 0 - 0.004480 ms
[TRT]  layer coverage/sig - 0.006048 ms
[TRT]  layer bbox/regressor - 0.068384 ms
[TRT]  layer bbox/regressor output reformatter 0 - 0.004384 ms
[TRT]  layer network time - 10.079423 ms
detectnet-console:  finished processing network  (1553239072920)
5 bounding boxes detected
detected obj 0  class #0 (class #0)  confidence=0.992481
bounding box 0  (437.663574, 257.960938)  (822.801514, 498.476562)  w=385.137939  h=240.515625
detected obj 1  class #0 (class #0)  confidence=0.787441
bounding box 1  (1565.563965, 317.078125)  (2028.589966, 875.656250)  w=463.026001  h=558.578125
detected obj 2  class #0 (class #0)  confidence=0.887009
bounding box 2  (975.426025, 372.353516)  (1202.887085, 710.609375)  w=227.461060  h=338.255859
detected obj 3  class #0 (class #0)  confidence=0.660282
bounding box 3  (52.094173, 381.800781)  (191.443420, 567.218750)  w=139.349243  h=185.417969
detected obj 4  class #0 (class #0)  confidence=0.970521
bounding box 4  (216.730774, 393.080078)  (399.695068, 510.371094)  w=182.964294  h=117.291016
detectnet-console:  writing 2049x1120 image to 'output_0.jpg'
detectnet-console:  successfully wrote 2049x1120 image to 'output_0.jpg'

shutting down...

DLA:

nvidia@jetson-0423518027970:~/jetson-inference/build/aarch64/bin$ ./detectnet-console dog_1.jpg output_1.jpg coco-dog
detectnet-console
  args (4):  0 [./detectnet-console]  1 [dog_1.jpg]  2 [output_1.jpg]  3 [coco-dog]  


detectNet -- loading detection network model from:
          -- prototxt     networks/DetectNet-COCO-Dog/deploy.prototxt
          -- model        networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel
          -- input_blob   'data'
          -- output_cvg   'coverage'
          -- output_bbox  'bboxes'
          -- mean_pixel   0.000000
          -- class_labels networks/DetectNet-COCO-Dog/class_labels.txt
          -- threshold    0.500000
          -- batch_size   1

[TRT]  TensorRT version 5.0.3
[TRT]  detected model format - caffe  (extension '.caffemodel')
[TRT]  desired precision specified for DLA_0: FASTEST
[TRT]  requested fasted precision for device DLA_0 without providing valid calibrator, disabling INT8
[TRT]  native precisions detected for DLA_0:  FP32, FP16, INT8
[TRT]  selecting fastest native precision for DLA_0:  FP16
[TRT]  attempting to open engine cache file networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel.1.1.DLA_0.FP16.engine
[TRT]  loading network profile from engine cache... networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel.1.1.DLA_0.FP16.engine
[TRT]  device DLA_0, networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel loaded
[TRT]  device DLA_0, enabling DLA core 0
[TRT]  device DLA_0, CUDA engine context initialized with 3 bindings
[TRT]  binding -- index   0
               -- name    'data'
               -- type    FP32
               -- in/out  INPUT
               -- # dims  3
               -- dim #0  3 (CHANNEL)
               -- dim #1  640 (SPATIAL)
               -- dim #2  640 (SPATIAL)
[TRT]  binding -- index   1
               -- name    'coverage'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  1 (CHANNEL)
               -- dim #1  40 (SPATIAL)
               -- dim #2  40 (SPATIAL)
[TRT]  binding -- index   2
               -- name    'bboxes'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  4 (CHANNEL)
               -- dim #1  40 (SPATIAL)
               -- dim #2  40 (SPATIAL)
[TRT]  binding to input 0 data  binding index:  0
[TRT]  binding to input 0 data  dims (b=1 c=3 h=640 w=640) size=4915200
[cuda]  cudaAllocMapped 4915200 bytes, CPU 0x21aff9000 GPU 0x21aff9000
[TRT]  binding to output 0 coverage  binding index:  1
[TRT]  binding to output 0 coverage  dims (b=1 c=1 h=40 w=40) size=6400
[cuda]  cudaAllocMapped 6400 bytes, CPU 0x21b4a9000 GPU 0x21b4a9000
[TRT]  binding to output 1 bboxes  binding index:  2
[TRT]  binding to output 1 bboxes  dims (b=1 c=4 h=40 w=40) size=25600
[cuda]  cudaAllocMapped 25600 bytes, CPU 0x21b6a9000 GPU 0x21b6a9000
device DLA_0, networks/DetectNet-COCO-Dog/snapshot_iter_38600.caffemodel initialized.
[cuda]  cudaAllocMapped 16 bytes, CPU 0x216b79200 GPU 0x216b79200
detectNet -- model has 1 object classes
detectNet -- failed to find networks/DetectNet-COCO-Dog/class_labels.txt
detectNet -- maximum bounding boxes:  6400
[cuda]  cudaAllocMapped 102400 bytes, CPU 0x21b8a9000 GPU 0x21b8a9000
[cuda]  cudaAllocMapped 25600 bytes, CPU 0x21b6af400 GPU 0x21b6af400
loaded image  dog_1.jpg  (1920 x 1080)  33177600 bytes
[cuda]  cudaAllocMapped 33177600 bytes, CPU 0x21baa9000 GPU 0x21baa9000
detectnet-console:  beginning processing network (1553242030651)
[TRT]  layer data to nvm - 0.566592 ms
[TRT]  layer {deploy_transform,conv1/7x7_s2,conv1/relu_7x7,pool1/3x3_s2,pool1/norm1,conv2/3x3_reduce,conv2/relu_3x3_reduce,conv2/3x3,conv2/relu_3x3,conv2/norm2,pool2/3x3_s2,inception_3a/1x1,inception_3a/relu_1x1,inception_3a/3x3_reduce,inception_3a/relu_3x3_reduce,inception_3a/3x3,inception_3a/relu_3x3,inception_3a/5x5_reduce,inception_3a/relu_5x5_reduce,inception_3a/5x5,inception_3a/relu_5x5,inception_3a/pool,inception_3a/pool_proj,inception_3a/relu_pool_proj,inception_3a/output,inception_3b/1x1,inception_3b/relu_1x1,inception_3b/3x3_reduce,inception_3b/relu_3x3_reduce,inception_3b/3x3,inception_3b/relu_3x3,inception_3b/5x5_reduce,inception_3b/relu_5x5_reduce,inception_3b/5x5,inception_3b/relu_5x5,inception_3b/pool,inception_3b/pool_proj,inception_3b/relu_pool_proj,inception_3b/output,pool3/3x3_s2,inception_4a/1x1,inception_4a/relu_1x1,inception_4a/3x3_reduce,inception_4a/relu_3x3_reduce,inception_4a/3x3,inception_4a/relu_3x3,inception_4a/5x5_reduce,inception_4a/relu_5x5_reduce,inception_4a/5x5,inception_4a/relu_5x5,inception_4a/pool,inception_4a/pool_proj,inception_4a/relu_pool_proj,inception_4a/output,inception_4b/1x1,inception_4b/relu_1x1,inception_4b/3x3_reduce,inception_4b/relu_3x3_reduce,inception_4b/3x3,inception_4b/relu_3x3,inception_4b/5x5_reduce,inception_4b/relu_5x5_reduce,inception_4b/5x5,inception_4b/relu_5x5,inception_4b/pool,inception_4b/pool_proj,inception_4b/relu_pool_proj,inception_4b/output,inception_4c/1x1,inception_4c/relu_1x1,inception_4c/3x3_reduce,inception_4c/relu_3x3_reduce,inception_4c/3x3,inception_4c/relu_3x3,inception_4c/5x5_reduce,inception_4c/relu_5x5_reduce,inception_4c/5x5,inception_4c/relu_5x5,inception_4c/pool,inception_4c/pool_proj,inception_4c/relu_pool_proj,inception_4c/output,inception_4d/1x1,inception_4d/relu_1x1,inception_4d/3x3_reduce,inception_4d/relu_3x3_reduce,inception_4d/3x3,inception_4d/relu_3x3,inception_4d/5x5_reduce,inception_4d/relu_5x5_reduce,inception_4d/5x5,inception_4d/relu_5x5,inception_4d/pool,inception_4d/pool_proj,inception_4d/relu_pool_proj,inception_4d/output,inception_4e/1x1,inception_4e/relu_1x1,inception_4e/3x3_reduce,inception_4e/relu_3x3_reduce,inception_4e/3x3,inception_4e/relu_3x3,inception_4e/5x5_reduce,inception_4e/relu_5x5_reduce,inception_4e/5x5,inception_4e/relu_5x5,inception_4e/pool,inception_4e/pool_proj,inception_4e/relu_pool_proj,inception_4e/output,inception_5a/1x1,inception_5a/relu_1x1,inception_5a/3x3_reduce,inception_5a/relu_3x3_reduce,inception_5a/3x3,inception_5a/relu_3x3,inception_5a/5x5_reduce,inception_5a/relu_5x5_reduce,inception_5a/5x5,inception_5a/relu_5x5,inception_5a/pool,inception_5a/pool_proj,inception_5a/relu_pool_proj,inception_5a/output,inception_5b/1x1,inception_5b/relu_1x1,inception_5b/3x3_reduce,inception_5b/relu_3x3_reduce,inception_5b/3x3,inception_5b/relu_3x3,inception_5b/5x5_reduce,inception_5b/relu_5x5_reduce,inception_5b/5x5,inception_5b/relu_5x5,inception_5b/pool,inception_5b/pool_proj,inception_5b/relu_pool_proj,inception_5b/output,cvg/classifier,coverage/sig,bbox/regressor} - 1.680768 ms
[TRT]  layer data copy finish - 0.074016 ms
[TRT]  layer bboxes from nvm - 41.842625 ms
[TRT]  layer bboxes copy finish - 0.003488 ms
[TRT]  layer coverage from nvm - 0.006144 ms
[TRT]  layer coverage copy finish - 0.002048 ms
[TRT]  layer network time - 44.175682 ms
detectnet-console:  finished processing network  (1553242030698)
2 bounding boxes detected
detected obj 0  class #0 (class #0)  confidence=0.873047
bounding box 0  (1265.812500, 261.562500)  (1670.812500, 541.792969)  w=405.000000  h=280.230469
detected obj 1  class #0 (class #0)  confidence=0.695312
bounding box 1  (622.687500, 336.155273)  (998.718750, 563.466797)  w=376.031250  h=227.311523
detectnet-console:  writing 1920x1080 image to 'output_1.jpg'
detectnet-console:  successfully wrote 1920x1080 image to 'output_1.jpg'

shutting down...

On line 74 in DLA information: [TRT] layer bboxes from nvm - 41.842625 ms
Seems that this layer wasted most time in inference, but I can’t understand why.

Hi,

Could you help to generate the DLA log from tensorrt app first?

$ cp -r /usr/src/tensorrt/ .
$ cd tensorrt/samples/trtexec/
$ make
$ cd ../../bin/
./trtexec --deploy=[your mode]
./trtexec --deploy=[your mode] --useDLACore=0 -allowGPUFallback

Thanks.