Why does a model split into subgraphs for DLA when GPU fallback not enabled? Does this affect the required output formats?

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.9.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
2.1.0
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Issue Description
Q: Why would a model split into multiple subgraphs when GPU fallback is not a consideration? Is it due to some problematic dependencies between nodes? Sorry but I cannot share a specific model yet. Maybe I could try and make a reproduceable example at a later time, but the engine builder seems super sensitive to small changes in the model, IDK if it’s possible. Maybe I could use some general guidance/info on how subgraphs are chosen. Am I also correct in assuming this is already decided before the timing phase?

To be clear, the model builds in this state.

Logs

[06/26/2025-20:38:43] [V] [TRT] ---------- Layers Running on GPU ----------
[06/26/2025-20:38:43] [V] [TRT] No layer is running on GPU
[06/26/2025-20:38:43] [V] [TRT] After Myelin optimization: 327 layers
[06/26/2025-20:38:45] [V] [TRT] DLA Memory Pool Sizes: Managed SRAM = 1 MiB, Local DRAM = 1024 MiB, Global DRAM = 512 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[…/model/Sigmoid_4]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 64 MiB, Global DRAM = 32 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/model/MaxPool_3…/model/…/Conv]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 8 MiB, Global DRAM = 8 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/…/model/Concat_12]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 16 MiB, Global DRAM = 8 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/model/Slice_8…/model/MaxPool_7]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 4 MiB, Global DRAM = 4 MiB
[06/26/2025-20:38:51] [V] [TRT] {ForeignNode[/model/Slice_1…/model/Sigmoid_7]} successfully offloaded to DLA.
Memory consumption: Managed SRAM = 1 MiB, Local DRAM = 32 MiB, Global DRAM = 4 MiB
[06/26/2025-20:38:51] [V] [TRT] DLA Memory Consumption Summary:
[06/26/2025-20:38:51] [V] [TRT] Number of DLA node candidates offloaded : 5 out of 5

I have a model that takes up 5 subgraphs, when a very similar model can be contained in only one subgraph. Four extra MaxPool nodes are added, essentially.

I believe that also the output format is restricted to CHW32 due to the tactics used, which is annoying for the post-processing. I’m not sure if this is related to the multiple subgraphs. If it’s unrelated, then maybe the multiple subgraphs is ok. then second question:
Q: When setting the outputIOFormat as int8:dla_linear, why does tensorrt still pick output format CHW32, then fail on a reformat, instead of picking tactics that don’t require a reformat?

Error

[06/26/2025-22:18:46] [V] [TRT] Adding reformat layer: Reformatted Input Tensor 1 to {ForeignNode[/model/MaxPool_3…/model/…/Conv]} (/model/…/…1/…/Mul_output_0) from Int8(14336,7168:32,112,1) to Int8(393216,8192,128,1)

06/26/2025-22:18:46] [E] Error[9]: [standardEngineBuilder.cpp::isValidDLAConfig::1821] Error Code 9: Internal Error (Default DLA is enabled but layer Reformatting CopyNode for Input Tensor 1 to {ForeignNode[/model/MaxPool_3…/model/…/Conv]} is not supported on DLA and falling back to GPU is not enabled.)

Dear @joseph.cleary ,
Could you share full log with used trtexec command?

I would like to, but there are some security concerns, I maybe could share with some obfuscation if it would be helpful.

When it lists all layers, they are all DlaLayer, none on GPU. Then as said it lists 5 subgraphs / DLA node candidates as seen in the end result engine.

I think the main part of the log before that is the registering layers. If there is something in there that could be a hint, I can take a closer look, but I really don’t see anything that would show why the model is split into subgraphs. Most of the info around subgraphs/DLA layer support I see relates to GPU Fallback which I don’t have enabled.
If no general info can be provided, then I may have to come back with a smaller illustrating example. But trying to figure out why it makes certain build decisions for DLA is very trial and error atm.
I also still want to ask why the builder chooses tactics with any output format first instead of choosing them based on the output IO format.

Could you check using dumpProfile with trtexec. It would be great if you can provide a dummy model which repro the issue ang get logs to get more insights.