DLA trtexec questions

Hi,

I read about how to use the DLA in this page:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#dla_topic

I am running a trtexec command line similar to what written in the page but with my onnx file
and i get the text below, can anyone explain please what does it mean and what to expect when running this net with DLA?

Thanks,
Gabi

./bin/trtexec --onnx=/mnt/nvmedisk/Gabi/Retinanet_ir_3class_resnet50_PT_640x512/trt_ir_3c_b1_640x512.onnx --output=prob --useDLACore=1 --fp16 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec # ./bin/trtexec --onnx=/mnt/nvmedisk/Gabi/Retinanet_ir_3class_resnet50_PT_640x512/trt_ir_3c_b1_640x512.onnx --output=prob --useDLACore=1 --fp16 --allowGPUFallback
[I] onnx: /mnt/nvmedisk/Gabi/Retinanet_ir_3class_resnet50_PT_640x512/trt_ir_3c_b1_640x512.onnx
[I] output: prob
[I] useDLACore: 1
[I] fp16
[I] allowGPUFallback

Input filename: /mnt/nvmedisk/Gabi/Retinanet_ir_3class_resnet50_PT_640x512/trt_ir_3c_b1_640x512.onnx
ONNX IR version: 0.0.4
Opset version: 9
Producer name: pytorch
Producer version: 1.1
Domain:
Model version: 0
Doc string:

WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 174) [PluginV2] is not running on DLA, falling back to GPU.
[W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 177) [PluginV2] is not running on DLA, falling back to GPU.
[W] [TRT] DLA LAYER: CBUF size requirement for layer (Unnamed Layer* 179) [Convolution] is 9banks, which exceeds the limit (8).
[W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 179) [Convolution] is not running on DLA, falling back to GPU.
[W] [TRT] DLA supports only 8 subgraphs per DLA core. Switching to GPU for layer (Unnamed Layer* 180) [Activation]
[W] [TRT] DLA supports only 8 subgraphs per DLA core. Switching to GPU for layer (Unnamed Layer* 276) [Activation]
[W] [TRT] DLA supports only 8 subgraphs per DLA core. Switching to GPU for layer (Unnamed Layer* 278) [Activation]
[W] [TRT] DLA supports only 8 subgraphs per DLA core. Switching to GPU for layer (Unnamed Layer* 279) [Activation]
[I] Average over 10 runs is 73.9307 ms (host walltime is 74.6987 ms, 99% percentile time is 76.5665).
[I] Average over 10 runs is 72.0191 ms (host walltime is 72.4281 ms, 99% percentile time is 76.1508).
[I] Average over 10 runs is 71.4415 ms (host walltime is 71.886 ms, 99% percentile time is 73.8089).
[I] Average over 10 runs is 70.9269 ms (host walltime is 71.1898 ms, 99% percentile time is 74.4588).
[I] Average over 10 runs is 71.2038 ms (host walltime is 71.4089 ms, 99% percentile time is 73.0836).
[I] Average over 10 runs is 69.9025 ms (host walltime is 70.118 ms, 99% percentile time is 72.2458).
[I] Average over 10 runs is 70.6782 ms (host walltime is 70.914 ms, 99% percentile time is 72.2416).
[I] Average over 10 runs is 69.9151 ms (host walltime is 70.18 ms, 99% percentile time is 71.1107).
[I] Average over 10 runs is 69.9766 ms (host walltime is 70.2225 ms, 99% percentile time is 70.7799).
[I] Average over 10 runs is 70.4809 ms (host walltime is 70.6907 ms, 99% percentile time is 72.1121).
&&&& PASSED TensorRT.trtexec # ./bin/trtexec --onnx=/mnt/nvmedisk/Gabi/Retinanet_ir_3class_resnet50_PT_640x512/trt_ir_3c_b1_640x512.onnx --output=prob --useDLACore=1 --fp16 --allowGPUFallback

Hi,

You can add --verbose to see the detail deployment of the model.

Based on your log, this are non-supported layer, which is fallback to the GPU.

[W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 174) [PluginV2] is not running on DLA, falling back to GPU.
[W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 177) [PluginV2] is not running on DLA, falling back to GPU.
[W] [TRT] DLA LAYER: CBUF size requirement for layer (Unnamed Layer* 179) [Convolution] is 9banks, which exceeds the limit (8).
[W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 179) [Convolution] is not running on DLA, falling back to GPU.

And also some layers are fallback into GPU due the the full workload of DLA.
Thanks.

Hi,

I have some more questions,

  1. When there is a fallback to the GPU, is there a way to measure the time spent on the DLA and the time spent on the GPU?

  2. when using also the option --saveEngine,

./trtexec --onnx=/media/foresight/nvmedisk/Xavier/vis/retinanet_rn50fpn.onnx --useDLACore=0 --fp16 --allowGPUFallback --saveEngine=engine_file

In what format is the engine file created?

Is it the same format as the plan format created in retinanet-examples when running the command below?

./export model.onnx engine.plan

see: https://github.com/NVIDIA/retinanet-examples/tree/master/extras/deepstream

Thanks,
Gabi

Hi,

1. We don’t have a tool for this.
However, TensorRT profiler do support layer-level execution time profiling.
So you can still check the performance of these fallbacked layers directly.

2. It is a serialized kernel data from TensorRT.
And yes, it is the same format of the link you shared.

But please noticed that this engine file is very sensitive to the TensorRT version and platform.
So it cannot be used cross different TensorRT version and platform.

Thanks.