Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.9.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other
Target Operating System
Linux
QNX
other
Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other
SDK Manager Version
2.1.0
other
Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other
Issue Description
Q: Why would a model split into multiple subgraphs when GPU fallback is not a consideration? Is it due to some problematic dependencies between nodes? Sorry but I cannot share a specific model yet. Maybe I could try and make a reproduceable example at a later time, but the engine builder seems super sensitive to small changes in the model, IDK if it’s possible. Maybe I could use some general guidance/info on how subgraphs are chosen. Am I also correct in assuming this is already decided before the timing phase?
To be clear, the model builds in this state.
Logs
[06/26/2025-20:38:43] [V] [TRT] ---------- Layers Running on GPU ----------
[06/26/2025-20:38:43] [V] [TRT] No layer is running on GPU
[06/26/2025-20:38:43] [V] [TRT] After Myelin optimization: 327 layers
[06/26/2025-20:38:45] [V] [TRT] DLA Memory Pool Sizes: Managed SRAM = 1 MiB, Local DRAM = 1024 MiB, Global DRAM = 512 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[…/model/Sigmoid_4]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 64 MiB, Global DRAM = 32 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/model/MaxPool_3…/model/…/Conv]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 8 MiB, Global DRAM = 8 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/…/model/Concat_12]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 16 MiB, Global DRAM = 8 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/model/Slice_8…/model/MaxPool_7]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 4 MiB, Global DRAM = 4 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/model/Slice_1…/model/Sigmoid_7]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 32 MiB, Global DRAM = 4 MiB
[06/26/2025-20:38:51] [V] [TRT] DLA Memory Consumption Summary:
[06/26/2025-20:38:51] [V] [TRT] Number of DLA node candidates offloaded : 5 out of 5
I have a model that takes up 5 subgraphs, when a very similar model can be contained in only one subgraph. Four extra MaxPool nodes are added, essentially.
I believe that also the output format is restricted to CHW32 due to the tactics used, which is annoying for the post-processing. I’m not sure if this is related to the multiple subgraphs. If it’s unrelated, then maybe the multiple subgraphs is ok. then second question:
Q: When setting the outputIOFormat as int8:dla_linear, why does tensorrt still pick output format CHW32, then fail on a reformat, instead of picking tactics that don’t require a reformat?
Error
[06/26/2025-22:18:46] [V] [TRT] Adding reformat layer: Reformatted Input Tensor 1 to {ForeignNode[/model/MaxPool_3…/model/…/Conv]} (/model/…/…1/…/Mul_output_0) from Int8(14336,7168:32,112,1) to Int8(393216,8192,128,1)
06/26/2025-22:18:46] [E] Error[9]: [standardEngineBuilder.cpp::isValidDLAConfig::1821] Error Code 9: Internal Error (Default DLA is enabled but layer Reformatting CopyNode for Input Tensor 1 to {ForeignNode[/model/MaxPool_3…/model/…/Conv]} is not supported on DLA and falling back to GPU is not enabled.)