What other limitations are there on the number of TensorRT contexts running concurrently on the DLA?

According to the official documentation, each DLA on the Jetson AGX Orin can theoretically run 16 TensorRT contexts concurrently. But when actually running, I can only concurrently have 10 TensorRT contexts per DLA.

The error is reported at the 11th use of the createExecutionContext() function. Here is the corresponding output from the verbose log:

Total per-runner device persistent memory is 0
Total per-runner host persistent memory is 96
Allocated activation device memory of size 630784
1: [cudlaUtils.cpp::LoadableManager::48] Error Code 1: DLA (Failed to deserialize DLA loadable)

So I’d like to ask: what other limitations are there on the number of TensorRT contexts running concurrently on the DLA?

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

3. Tutorial

Startup deep learning tutorial:

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

Hi,

To be more precise, only 16 DLA loadable can be loaded concurrently per core.

But TensorRT might split a model into multiple DLA loadables.
So the real concurrent TensorRT engine on DLA depends on the model architecture.

Thanks.

Does this mean that my 10 TensorRT contexts are split into 16 DLA loadables, reaching the theoretical cap of 16?

Hi,

It is possible.

To know more about the TensorRT behavior, please share the conversion log with --verbose.
It contains the details of how TensorRT placement the inference tasks.

Thanks.

Here is my point: My TensorRT contexts are all created from the same engine, and theoretically each context corresponds to the same number of DLA loadables (not sure if this theory is correct). But I don’t get error when running 10 TensorRT contexts concurrently, due to the upper limit of running DLA loadables concurrently on each DLA is 16, if the number of DLA loadables corresponding to each TensorRT context is greater than or equal to 2, then the number of DLA loadables running concurrently is If the number of DLA loadables corresponding to each TensorRT context is greater than or equal to 2, then the number of DLA loadables running concurrently is greater than or equal to 20, which is more than the upper limit and should report an error, but because in fact there is no error, so each TensorRT context can only correspond to one DLA loadable, so when I run 11 TensorRT contexts concurrently, it will only correspond to 11 DLA loadables, and it doesn’t reach the upper limit of 16 DLA loadables.

Attachment is my verbose log. My program is compiling TensorRT engines for each of the 4 onnx models, with two of the engines running on two different DLA core.
build_plan.log (1006.8 KB)

Hi,

Based on your log, there is no layer running on the DLA.
Please help to check if. you convert the model as expected first.

---------- Layers Running on DLA ----------
---------- Layers Running on GPU ----------
[GpuLayer] SCALE: resnetv22_batchnorm0_fwd
[GpuLayer] CONVOLUTION: resnetv22_conv0_fwd + resnetv22_batchnorm1_fwd + resnetv22_relu0_fwd
[GpuLayer] POOLING: resnetv22_pool0_fwd
[GpuLayer] SCALE: resnetv22_stage1_batchnorm0_fwd + resnetv22_stage1_activation0
[GpuLayer] CONVOLUTION: resnetv22_stage1_conv0_fwd + resnetv22_stage1_batchnorm1_fwd + resnetv22_stage1_activation1
[GpuLayer] CONVOLUTION: resnetv22_stage1_conv1_fwd + resnetv22_stage1__plus0
[GpuLayer] SCALE: resnetv22_stage1_batchnorm2_fwd + resnetv22_stage1_activation2
[GpuLayer] CONVOLUTION: resnetv22_stage1_conv2_fwd + resnetv22_stage1_batchnorm3_fwd + resnetv22_stage1_activation3
[GpuLayer] CONVOLUTION: resnetv22_stage1_conv3_fwd + resnetv22_stage1__plus1

Thanks.

Sorry, this attachment is a verbose log of my two DLA models.
DLA_build_log.log (54.9 KB)

Hi,

The model only has a single DLA loadable:

---------- Layers Running on DLA ----------
[DlaLayer] {ForeignNode[resnetv22_stage2_batchnorm0_fwd...resnetv22_stage2__plus1]}
---------- Layers Running on GPU ----------

Would you mind also checking the RAM usage?
Could you check the overall Managed SRAM / Local DRAM / Global DRAM to see if there are still resources remaining for the 11th loadable?

Thanks.

Sorry, how to check the RAM usage? I don’t know how to see if there are still resources remaining.

Hi,

Sorry for the late update.

You can find this info in the TensorRT log as well.
For example:

Memory consumption details:
	Pool Sizes: Managed SRAM = 0.5 MiB,	Local DRAM = 1024 MiB,	Global DRAM = 512 MiB
	Required: Managed SRAM = 0.5 MiB,	Local DRAM = 4 MiB,	Global DRAM = 4 MiB

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.