Getting less throughput while enabling DLAs on Jetson AGX Orin

Hi,

Currently we are downloading pretrained Resnet50 weights from keras and converted into onnx with multiple batches, after the conversion using below syntax we have converted into tensorRT

/usr/src/tensorrt/bin/trtexec --onnx=onnx_model.onnx --saveEngine=resnet50.trt --explicitBatch --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8 --useDLACore=0 --allowGPUFallback=True --sparsity=disable --verbose=True

We have prepared 2 models - one with GPU, and with DLA.
After inferencing we have collected the below results

From the above table, with GPU results are quite acceptable but with DLAs showing very low results. Moreover, we have seen this pattern with other tensorflow models (mobilenet, ssd-mobilenet, vgg etc) just wanted to know why it is giving very less throughput.

Can you please suggest why we are observing less throughput with DLA ?

Hi,

We are moving this post to the Jetson AGX Orin forum to get better help.

Thank you.

I have this same issue with a different model. Same exact model, only difference is whether or not I added --useDLACore=0 --allowGPUFallback=True.

In FP16, it is ~10X slower to use the DLA in my case.

Hi,

Do you want to compare the performance between GPU and DLA?

Please find information in our document below:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#troubleshooting

Q: Why does my network run slower when using DLA compared to without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Which implementation to use depends on your latency or throughput requirements and your power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations at the same time to further increase the throughput of your network.

Thanks.