Issues when using DLA with TensorRT 7.1.3 compared to TensorRT 6.0.1

Description

I have two Jetson AGX Xavier boards with different Jetpack versions.
One is Jetpack 4.3 (with TensorRT 6.0.1) and the other is Jetpack 4.4 (with TensorRT 7.1.3).

I will call these two devices as Device A and Device B.

Device A: Jetson AGX Xavier with Jetpack 4.3
Device B: Jetson AGX Xavier with Jetpack 4.4

I ran the same code on DLA of both devices, but I found some differences.
I set MAXN nvp model and Jetson_clocks on both devices

1. Inference time with Device B is longer than Device A.

When I ran my code on Device A, it takes 9.4 seconds, but in Device B it took 11.9 seconds.
I don’t know why Device B took more time.

2. Device B has a limitation on creating multiple contexts

I created multiple contexts like the following example codes.

for(j = 0 ; j < NUM_CONTEXTS ; j++)
{
context[j] = engine->createExecutionContext();
assert(context[j]!= nullptr );
}

For Device A, I can create multiple execution contexts, but Device B occurs an error when the certain amount of contexts are created.
For this reason, I can create 8 contexts with Device A, but I can only create 4 contexts in Device B.

Error in Device B looks like the below.

NvMapMemAllocInternalTagged: 1074810371 error 12
NvMapMemHandleAlloc: error 12
NVMEDIA_DLA : 1686, ERROR: runtime loadBare failed. err: 0x6.
…/rtExt/dla/native/dlaUtils.cpp (166) - DLA Error in deserialize: 7 (NvMediaDlaLoadLoadable : load loadable failed.)
FAILED_ALLOCATION: std::exception

Another difference is that createExecutionContext() call in Device B takes longer time than createExecutionContext() in Device A. In Device B, it takes a second to create each context, but Device A immediately creates multiple execution contexts.

Do you know why these issues are happened in Jetpack 4.4 with TensorRT 7.1.3?

Hi @chjej202
Jetson team should be able to help you better here.
Thanks!

Hi,

Would you mind to share your model and the source that can reproduce this issue?
Is this reproducible with trtexec?

Thanks.

Hi,

I tried with trtexec and found that this issue is reproducible.

I ran the following command:

user@nvidia:/usr/src/tensorrt/data/resnet50$ …/…/bin/trtexec --avgRuns=300 --deploy=ResNet50_N2.prototxt --fp16 --batch=1 --iterations=300 --output=prob --useDLACore=0 --useSpinWait --allowGPUFallback --streams=8

By changing the streams option, you can change the number of execution contexts to be created.

In Device A (Jetson AGX Xavier with Jetpack 4.3), it occurs no error and shows the following execution time results.

“Average over 300 runs is 6.04887 ms (host walltime is 6.08014 ms, 99% percentile time is 6.21517).”

In Device B (Jetson AGX Xavier with Jetpack 4.4), it shows an error with the following errors on the screen.

NvMapMemAllocInternalTagged: 1074810371 error 12
NvMapMemHandleAlloc: error 12
NVMEDIA_DLA : 1686, ERROR: runtime loadBare failed. err: 0x6.
[08/21/2020-19:23:30] [E] [TRT] …/rtExt/dla/native/dlaUtils.cpp (166) - DLA Error in deserialize: 7 (NvMediaDlaLoadLoadable : load loadable failed.)
[08/21/2020-19:23:30] [E] [TRT] FAILED_ALLOCATION: std::exception

The above error messages are shown 4 times, so it means 4 execution contexts are failed to be created and the rest 4 execution contexts are created properly.

Also, execution time shown on the screen was the following message which took longer than Device A.

Average on 300 runs - GPU latency: 7.01389 ms - Host latency: 7.03811 ms (end to end 7.04715 ms, enqueue 0.338025 ms)

Hi,

Thanks for the reporting.
We can reproduce this issue in our environment.
Will check this with our internal team and update more information with you later.

Here is a memory limitation in 1GiB for DLA intermediate tensor data.
This error may hit the limitation but we need to check this with our internal team first.

Thanks.

Thank you. I will wait for your reply.

Hi,

Thanks for your patience.
We are still working on this issue and will keep you updated.

This limitation comes from the memory allocation strategy of DLA, which can only allows 4 runtime instance.
We are working on a different way for allocation.
Will let you know once it is ready.

Thanks.