According to the official documentation, each DLA on the Jetson AGX Orin can theoretically run 16 TensorRT contexts concurrently. But when actually running, I can only concurrently have 10 TensorRT contexts per DLA.
The error is reported at the 11th use of the createExecutionContext() function. Here is the corresponding output from the verbose log:
Total per-runner device persistent memory is 0
Total per-runner host persistent memory is 96
Allocated activation device memory of size 630784
1: [cudlaUtils.cpp::LoadableManager::48] Error Code 1: DLA (Failed to deserialize DLA loadable)
So I’d like to ask: what other limitations are there on the number of TensorRT contexts running concurrently on the DLA?
If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.
To know more about the TensorRT behavior, please share the conversion log with --verbose.
It contains the details of how TensorRT placement the inference tasks.
Here is my point: My TensorRT contexts are all created from the same engine, and theoretically each context corresponds to the same number of DLA loadables (not sure if this theory is correct). But I don’t get error when running 10 TensorRT contexts concurrently, due to the upper limit of running DLA loadables concurrently on each DLA is 16, if the number of DLA loadables corresponding to each TensorRT context is greater than or equal to 2, then the number of DLA loadables running concurrently is If the number of DLA loadables corresponding to each TensorRT context is greater than or equal to 2, then the number of DLA loadables running concurrently is greater than or equal to 20, which is more than the upper limit and should report an error, but because in fact there is no error, so each TensorRT context can only correspond to one DLA loadable, so when I run 11 TensorRT contexts concurrently, it will only correspond to 11 DLA loadables, and it doesn’t reach the upper limit of 16 DLA loadables.
Attachment is my verbose log. My program is compiling TensorRT engines for each of the 4 onnx models, with two of the engines running on two different DLA core. build_plan.log (1006.8 KB)
---------- Layers Running on DLA ----------
[DlaLayer] {ForeignNode[resnetv22_stage2_batchnorm0_fwd...resnetv22_stage2__plus1]}
---------- Layers Running on GPU ----------
Would you mind also checking the RAM usage?
Could you check the overall Managed SRAM / Local DRAM / Global DRAM to see if there are still resources remaining for the 11th loadable?
You can find this info in the TensorRT log as well.
For example:
Memory consumption details:
Pool Sizes: Managed SRAM = 0.5 MiB, Local DRAM = 1024 MiB, Global DRAM = 512 MiB
Required: Managed SRAM = 0.5 MiB, Local DRAM = 4 MiB, Global DRAM = 4 MiB