How does the TRT inference run on both DLA and GPUs?

Hi, please refer to the below MLCommons benchmark link. For the AGX Orin row, it is mentioned that both GPUs and the two DLAs are used for TRT inference.

So, I wanted to know how is the inference run parallely on both the GPU and the DLA.
Also, how can both the DLAs be utilized at the same time while doing the inference.

MLCommons link : v3.0 Results | MLCommons

And, in the corresponding github repo also, I couldn’t find any lines of code which indicate the above behaviour of using both the DLAs concurrently or running inference across multiple devices parallely.

Resnet50 TRT Inference code for reference : https://github.com/mlcommons/inference_results_v3.0/blob/main/closed/NVIDIA/code/resnet50/tensorrt/ResNet50.py

Hi,

DLAs and GPU can run concurrently.

But since DLA is a hardware accelerator with limited functionality, some models cannot fully run on the DLA.
If a model requires GPU fallback frequently, the data transfer between GPU and DLA can decrease the performance.
In such a case, the throughput of DLAs+GPU might not be higher than the GPU-only mode.

We have another benchmark repo that can give you some idea about this:
Devices=3 indicates DLAs+GPU, and devices=1 means GPU-only.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.