Can not make tensorrt work on DLA (Jetson Xavier)

dannykario · November 17, 2020, 1:00pm

( Jetson Xavier, TRT7)

I’m trying to use the DLA engines in Jetson Xavier.
I have a reduced version of mobilenet (reduced = removed few layers), as a uff file.

When I am using trtexec to convert the uff and generate engine for the GPU, and then load the engine, all is ok.

Now, I tried to convert to to be used on DLA.
I issued:

/usr/src/tensorrt/bin/trtexec --avgRuns=10 --uff=./mobilenet_v1_1.0_224_353i.uff --fp16 --batch=16 --iterations=100 --uffInput=input,3,224,224 --output=MobilenetV1/Logits/Dropout_1b/Identity --workspace=1024 --saveEngine=./mobilenet_DLA_0.engine --useDLACore=0

and it seems ok:

[11/17/2020-23:45:20] [I] === System Options ===
[11/17/2020-23:45:20] [I] Device: 0
[11/17/2020-23:45:20] [I] DLACore: 0
[11/17/2020-23:45:20] [I] Plugins:
[11/17/2020-23:45:20] [I] === Inference Options ===
[11/17/2020-23:45:20] [I] Batch: 16
[11/17/2020-23:45:20] [I] Input inference shapes: model
[11/17/2020-23:45:20] [I] Iterations: 100
[11/17/2020-23:45:20] [I] Duration: 3s (+ 200ms warm up)
[11/17/2020-23:45:20] [I] Sleep time: 0ms
[11/17/2020-23:45:20] [I] Streams: 1
[11/17/2020-23:45:20] [I] ExposeDMA: Disabled
[11/17/2020-23:45:20] [I] Spin-wait: Disabled
[11/17/2020-23:45:20] [I] Multithreading: Disabled
[11/17/2020-23:45:20] [I] CUDA Graph: Disabled
[11/17/2020-23:45:20] [I] Skip inference: Disabled
[11/17/2020-23:45:20] [I] Inputs:
[11/17/2020-23:45:20] [I] === Reporting Options ===
[11/17/2020-23:45:20] [I] Verbose: Disabled
[11/17/2020-23:45:20] [I] Averages: 10 inferences
[11/17/2020-23:45:20] [I] Percentile: 99
[11/17/2020-23:45:20] [I] Dump output: Disabled
[11/17/2020-23:45:20] [I] Profile: Disabled
[11/17/2020-23:45:20] [I] Export timing to JSON file:
[11/17/2020-23:45:20] [I] Export output to JSON file:
[11/17/2020-23:45:20] [I] Export profile to JSON file:
[11/17/2020-23:45:20] [I]
[11/17/2020-23:45:21] [I] [TRT]
[11/17/2020-23:45:21] [I] [TRT] --------------- Layers running on DLA:
[11/17/2020-23:45:21] [I] [TRT] {MobilenetV1/MobilenetV1/Conv2d_0/Conv2D,MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_0/Relu6,MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_1_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_2_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_2_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_2_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_3_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_3_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_3_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_4_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_4_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_4_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_5_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_5_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_5_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_6_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_6_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_6_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_7_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_7_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_7_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_8_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_8_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_8_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_9_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_9_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_9_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_10_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_10_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_10_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_11_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_11_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_11_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_12_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_12_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_12_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_13_depthwise/depthwise,MobilenetV1/MobilenetV1/Conv2d_13_depthwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6,MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Conv2D,MobilenetV1/MobilenetV1/Conv2d_13_pointwise/BatchNorm/FusedBatchNorm,MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6,MobilenetV1/Logits/AvgPool_1a/AvgPool},
[11/17/2020-23:45:21] [I] [TRT] --------------- Layers running on GPU:
[11/17/2020-23:45:21] [I] [TRT] MobilenetV1/Logits/Dropout_1b/Identity_HL_1804289383, MobilenetV1/Logits/Dropout_1b/Identity,
[11/17/2020-23:45:33] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.
[11/17/2020-23:45:34] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[11/17/2020-23:45:38] [I] Starting inference threads
[11/17/2020-23:45:43] [I] Warmup completed 80 queries over 200 ms
[11/17/2020-23:45:43] [I] Timing trace has 1600 queries over 4.83661 s
[11/17/2020-23:45:43] [I] Trace averages of 10 runs:
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 47.9298 ms - Host latency: 48.1849 ms (end to end 48.1943 ms, enqueue 1.49098 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.0195 ms - Host latency: 48.2748 ms (end to end 48.2837 ms, enqueue 1.51106 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.2594 ms - Host latency: 48.515 ms (end to end 48.5413 ms, enqueue 1.70781 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.2031 ms - Host latency: 48.4689 ms (end to end 48.5072 ms, enqueue 1.66277 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.1273 ms - Host latency: 48.3823 ms (end to end 48.3895 ms, enqueue 1.58535 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.1673 ms - Host latency: 48.4382 ms (end to end 48.4783 ms, enqueue 1.5334 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.2717 ms - Host latency: 48.5269 ms (end to end 48.5367 ms, enqueue 1.53513 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 48.1646 ms - Host latency: 48.4327 ms (end to end 48.4643 ms, enqueue 1.65442 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 47.8747 ms - Host latency: 48.1302 ms (end to end 48.1389 ms, enqueue 1.43381 ms)
[11/17/2020-23:45:43] [I] Average on 10 runs - GPU latency: 47.8593 ms - Host latency: 48.1168 ms (end to end 48.1258 ms, enqueue 1.34902 ms)
[11/17/2020-23:45:43] [I] Host Latency
[11/17/2020-23:45:43] [I] min: 48.0854 ms (end to end 48.0977 ms)
[11/17/2020-23:45:43] [I] max: 48.9292 ms (end to end 48.9419 ms)
[11/17/2020-23:45:43] [I] mean: 48.3471 ms (end to end 48.366 ms)
[11/17/2020-23:45:43] [I] median: 48.349 ms (end to end 48.3591 ms)
[11/17/2020-23:45:43] [I] percentile: 48.9292 ms at 99% (end to end 48.9419 ms at 99%)
[11/17/2020-23:45:43] [I] throughput: 330.811 qps
[11/17/2020-23:45:43] [I] walltime: 4.83661 s

I also noticed its ~6 times slower than running in GPU only, which makes sense with the numbers published in the tutorials.

BUT

When I’m trying to run it in my program, i get DLA errors.

So, I tried to load the engine I just created, again via trtexec:

/usr/src/tensorrt/bin/trtexec --fp16 --iterations=100 --loadEngine=./mobilenet_DLA_0.engine --useDLACore=0 --allowGPUFallback=enabled

and got similar errors:

NVMEDIA_DLA :  885, ERROR: runtime registerEvent failed. err: 0x4.
NVMEDIA_DLA : 1849, ERROR: RequestSubmitEvents failed. status: 0x7.
[11/17/2020-23:56:14] [E] [TRT] ../rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
[11/17/2020-23:56:14] [E] [TRT] FAILED_EXECUTION: std::exception

Any ideas ?

Thanks for the help !

kayccc · November 18, 2020, 2:55am

Moving to Jetson AGX Xavier forum for resolution.

AastaLLL · November 18, 2020, 7:03am

Hi,

Could you try if there is a same error with trtexec with --loadEngine flag?
If yes, please share the log with --verbose enabled.

Thanks.

Topic		Replies	Views
Unable to use DLA with TensorRT Jetson AGX Xavier	11	3375	November 8, 2018
Cannot create DLA engine using trtexec Jetson Xavier NX tensorrt	2	1538	October 18, 2021
Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works Jetson Xavier NX dla	7	1310	August 10, 2022
Unable to build tensorrt engine with DLA enabled on Jetson Xavier NX Jetson Xavier NX tensorrt , cudnn	7	311	May 15, 2024
DLA and GPU running at the same time - performance question Jetson AGX Xavier nvbugs , performance , dla	24	3150	October 18, 2021
Engine creation fails when using DLA with GPU fallback Jetson AGX Xavier tensorrt , dla	11	1994	March 22, 2022
Cannot create DLA engine using trtexec on Xavier Jetson AGX Xavier tensorrt , dla	8	1027	July 1, 2022
DLA trtexec questions Jetson AGX Xavier	4	1803	October 18, 2021
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	2745	June 8, 2022
DLA execution fails with out of memory error Jetson AGX Xavier	5	723	October 18, 2021

Can not make tensorrt work on DLA (Jetson Xavier)

Related topics