Can not make tensorrt work on DLA (Jetson Xavier)

( Jetson Xavier, TRT7)

I’m trying to use the DLA engines in Jetson Xavier.
I have a reduced version of mobilenet (reduced = removed few layers), as a uff file.

When I am using trtexec to convert the uff and generate engine for the GPU, and then load the engine, all is ok.

Now, I tried to convert to to be used on DLA.
I issued:

/usr/src/tensorrt/bin/trtexec --avgRuns=10 --uff=./mobilenet_v1_1.0_224_353i.uff --fp16 --batch=16 --iterations=100 --uffInput=input,3,224,224 --output=MobilenetV1/Logits/Dropout_1b/Identity --workspace=1024 --saveEngine=./mobilenet_DLA_0.engine --useDLACore=0

and it seems ok:

[11/17/2020-23:45:20] [I] === System Options ===
[11/17/2020-23:45:20] [I] Device: 0
[11/17/2020-23:45:20] [I] DLACore: 0
[11/17/2020-23:45:20] [I] Plugins:
[11/17/2020-23:45:20] [I] === Inference Options ===
[11/17/2020-23:45:20] [I] Batch: 16
[11/17/2020-23:45:20] [I] Input inference shapes: model
[11/17/2020-23:45:20] [I] Iterations: 100
[11/17/2020-23:45:20] [I] Duration: 3s (+ 200ms warm up)
[11/17/2020-23:45:20] [I] Sleep time: 0ms
[11/17/2020-23:45:20] [I] Streams: 1
[11/17/2020-23:45:20] [I] ExposeDMA: Disabled
[11/17/2020-23:45:20] [I] Spin-wait: Disabled
[11/17/2020-23:45:20] [I] Multithreading: Disabled
[11/17/2020-23:45:20] [I] CUDA Graph: Disabled
[11/17/2020-23:45:20] [I] Skip inference: Disabled
[11/17/2020-23:45:20] [I] Inputs:
[11/17/2020-23:45:20] [I] === Reporting Options ===
[11/17/2020-23:45:20] [I] Verbose: Disabled
[11/17/2020-23:45:20] [I] Averages: 10 inferences
[11/17/2020-23:45:20] [I] Percentile: 99
[11/17/2020-23:45:20] [I] Dump output: Disabled
[11/17/2020-23:45:20] [I] Profile: Disabled
[11/17/2020-23:45:20] [I] Export timing to JSON file:
[11/17/2020-23:45:20] [I] Export output to JSON file:
[11/17/2020-23:45:20] [I] Export profile to JSON file:
[11/17/2020-23:45:20] [I]
[11/17/2020-23:45:21] [I] [TRT]
[11/17/2020-23:45:21] [I] [TRT] --------------- Layers running on DLA:
[11/17/2020-23:45:21] [I] [TRT] {MobilenetV1/MobilenetV1/Conv2d_0/Conv2D,MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_0/Relu6,MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_1_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_2_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_2_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_2_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_3_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_4_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_4_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_4_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_5_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_5_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_5_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_6_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_6_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_6_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_7_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_7_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_7_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_8_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_8_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_8_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_9_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_9_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_9_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_10_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_10_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_10_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_11_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_11_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_11_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_12_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_12_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_12_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_13_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_13_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_13_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6,MobilenetV1/Logits/AvgPool_1a/AvgPool},
[11/17/2020-23:45:21] [I] [TRT] --------------- Layers running on GPU:
[11/17/2020-23:45:21] [I] [TRT] MobilenetV1/Logits/Dropout_1b/Identity_HL_1804289383, MobilenetV1/Logits/Dropout_1b/Identity,
[11/17/2020-23:45:33] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.
[11/17/2020-23:45:34] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[11/17/2020-23:45:38] [I] Starting inference threads
[11/17/2020-23:45:43] [I] Warmup completed 80 queries over 200 ms
[11/17/2020-23:45:43] [I] Timing trace has 1600 queries over 4.83661 s
[11/17/2020-23:45:43] [I] Trace averages of 10 runs:
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 47.9298 ms - Host latency: 48.1849 ms (end to end 48.1943 ms, enqueue 1.49098 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.0195 ms - Host latency: 48.2748 ms (end to end 48.2837 ms, enqueue 1.51106 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.2594 ms - Host latency: 48.515 ms (end to end 48.5413 ms, enqueue 1.70781 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.2031 ms - Host latency: 48.4689 ms (end to end 48.5072 ms, enqueue 1.66277 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.1273 ms - Host latency: 48.3823 ms (end to end 48.3895 ms, enqueue 1.58535 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.1673 ms - Host latency: 48.4382 ms (end to end 48.4783 ms, enqueue 1.5334 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.2717 ms - Host latency: 48.5269 ms (end to end 48.5367 ms, enqueue 1.53513 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.1646 ms - Host latency: 48.4327 ms (end to end 48.4643 ms, enqueue 1.65442 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 47.8747 ms - Host latency: 48.1302 ms (end to end 48.1389 ms, enqueue 1.43381 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 47.8593 ms - Host latency: 48.1168 ms (end to end 48.1258 ms, enqueue 1.34902 ms)
[11/17/2020-23:45:43] [I] Host Latency
[11/17/2020-23:45:43] [I] min: 48.0854 ms (end to end 48.0977 ms)
[11/17/2020-23:45:43] [I] max: 48.9292 ms (end to end 48.9419 ms)
[11/17/2020-23:45:43] [I] mean: 48.3471 ms (end to end 48.366 ms)
[11/17/2020-23:45:43] [I] median: 48.349 ms (end to end 48.3591 ms)
[11/17/2020-23:45:43] [I] percentile: 48.9292 ms at 99% (end to end 48.9419 ms at 99%)
[11/17/2020-23:45:43] [I] throughput: 330.811 qps
[11/17/2020-23:45:43] [I] walltime: 4.83661 s

I also noticed its ~6 times slower than running in GPU only, which makes sense with the numbers published in the tutorials.

BUT

When I’m trying to run it in my program, i get DLA errors.

So, I tried to load the engine I just created, again via trtexec:

/usr/src/tensorrt/bin/trtexec --fp16 --iterations=100 --loadEngine=./mobilenet_DLA_0.engine --useDLACore=0 --allowGPUFallback=enabled

and got similar errors:

NVMEDIA_DLA :  885, ERROR: runtime registerEvent failed. err: 0x4.
NVMEDIA_DLA : 1849, ERROR: RequestSubmitEvents failed. status: 0x7.
[11/17/2020-23:56:14] [E] [TRT] ../rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
[11/17/2020-23:56:14] [E] [TRT] FAILED_EXECUTION: std::exception

Any ideas ?

Thanks for the help !

Moving to Jetson AGX Xavier forum for resolution.

Hi,

Could you try if there is a same error with trtexec with --loadEngine flag?
If yes, please share the log with --verbose enabled.

Thanks.