*** Please note that this reply is generated by LLM automatically ***
I’d be happy to help you with your questions about DLA model inference on the Orin platform.
Issue 1: Increased inference latency of DLA model when running another program on the GPU
The increased inference latency of the DLA model when running another program on the GPU is likely due to the shared resources between the DLA and GPU. Although the DLA is a separate hardware accelerator, it still shares some resources with the GPU, such as memory bandwidth and system bus bandwidth.
When another program is running on the GPU, it may be consuming some of these shared resources, leading to increased contention and latency for the DLA model. This is especially true if the GPU program is memory-intensive or has high system bus utilization.
To mitigate this issue, you can try to optimize the GPU program to reduce its memory bandwidth and system bus utilization. Additionally, you can try to use the --useDLACore flag with trtexec to specify which DLA core to use, as this may help to reduce contention between the DLA and GPU.
Issue 2: TensorRT partitioning the CNN model into multiple subgraphs when targeting DLA
TensorRT partitions the CNN model into multiple subgraphs when targeting DLA because of the limitations of the DLA hardware. The DLA is a fixed-function engine that is optimized for specific types of neural network layers, such as convolution, deconvolution, activation, pooling, local response normalization, and fully connected layers.
When a CNN model is compiled for the DLA, TensorRT analyzes the model and partitions it into subgraphs that can be executed on the DLA. This partitioning is done to ensure that each subgraph can be executed efficiently on the DLA, taking into account the DLA’s limitations and optimizations.
There are several reasons why TensorRT may partition a CNN model into multiple subgraphs when targeting DLA:
- Layer limitations: The DLA has limitations on the types of layers that can be executed on it. For example, the DLA does not support certain types of activation functions or normalization layers. When a CNN model contains layers that are not supported by the DLA, TensorRT may partition the model into subgraphs that can be executed on the DLA and those that cannot.
- Memory constraints: The DLA has limited memory resources, and large CNN models may not fit within these resources. TensorRT may partition the model into subgraphs that can be executed on the DLA within the available memory resources.
- Optimization: TensorRT may partition the model into subgraphs to optimize the execution of the model on the DLA. For example, TensorRT may group together layers that can be executed in parallel on the DLA, or it may split apart layers that have different execution requirements.
To avoid partitioning the model into multiple subgraphs, you can try to optimize the CNN model to make it more suitable for execution on the DLA. This may involve simplifying the model, reducing the number of layers, or using DLA-optimized layers.
Comparison with other platforms
The behavior you are observing on the Orin platform is similar to what you might see on other platforms that use the DLA, such as the Xavier or Nano platforms. However, the specific details of the partitioning and execution may vary depending on the platform and the CNN model being used.
In general, the DLA is a powerful hardware accelerator that can provide significant performance benefits for CNN models. However, it does require careful optimization and tuning to achieve the best results.
I hope this helps! Let me know if you have any further questions or if there’s anything else I can help with.
*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***