Does DLA work faster than GPU in fp16 model?

Hello.

I am trying to migrate MobileNetV2 to AGX Xavier with DLA conversion.

I just removed the end of model because of the tensorrt conversion error about GlobalAveragePool and Gemm layer.
Then I successfully convert MobileNetV2 to rtr files just for DLA and GPU with fp16 format.
but the latency time seems to be weird
DLA 16fp : 6.25089 ms
GPU 16fp : 2.88255 ms

DLA only model must be faster than GPU only, isn’t it?

Thanks.

Hi,

Please noted we don’t expect DLA run faster than GPU.
You can find more details below:

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-820-ea/developer-guide/index.html#troubleshooting

Q: Why does my network run slower when using DLA compared to without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Which implementation to use depends on your latency or throughput requirements and your power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations at the same time to further increase the throughput of your network.

For the performance issue, would you mind sharing the log with -dumpProfile with us?

$ /usr/src/tensorrt/bin/trtexec --dumpProfile ...

Thanks.

dumpProfile.log (16.3 KB)
I attached the dumpProfile file

Is DLA mainly for the energy efficiency and finally harmful the latency?
But I have ever read some document that working on DLA is faster than GPU ( unfortunately I can’t search it OTL.)

Thanks.

Hi,

DLA inference might be slower if some fallback is enabled.
But in general, it should give you a similar performance as GPU.

Would you mind attaching the profiling data of GPU mode as well?
This will help us to compare the performance at the layer level.

More, in case you don’t aware of this.
You should boost the device with the following command before benchmarking:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hello,

I attached the followings
could you please look them into?
GPU_fp16.log : GPU only mode
DLA_fp16.log : DLA only mode

Thanks.

GPU_fp16.log (28.8 KB)
DLA_fp16.log (14.2 KB)

Hi,

Thanks for your sharing.

The performance from DLA is much slower at the layer level as well.
Is it possible to share the model with us so we can check it further?

Thanks.

Hello,

I attached two modes.

Thanks
resnet50_sim_mod_DLA_fp16.trt (57.2 MB)
resnet50_sim_mod_GPU_fp16.trt (47.2 MB)

Hi,

Sorry for the non-clear statement.

Would you mind sharing the original ONNX model with us?
Since the TensorRT engine is not portable, this will help us test on different platforms and software versions.

Thanks.

Hello,

I attached onnx model file.
resnet50_sim_mod.onnx (89.6 MB)

Thanks.

Hi,

Thanks for sharing the model.

Confirmed that we can reproduce the same performance issue internally.
We are checking this with our internal team. Will share more information with you later.

Hello,

At the same time I also measured some values with following commands
$ time runTest.sh & runTest.sh & runTest.sh & runTest.sh &
It can be to run 4 processes simultaneously.


It can mean these

  1. The execution time of every single DLA units is much than GPU’s
  2. The execution time of total DLA units is less than GPU’s

Could you explain why it is?
Commonly the execution time of total DLA units must also be much than GPU’s, isn’t it?

Thanks.

Hi,

We got some feedback from our internal team.

For Xavier, GPU is 20TOPs while DLA is 5TOPs.
So purely from the TOPs point of view, GPU is 4x faster than DLA.

Thanks.

Hello,

Could you please reply my questions about multi-process performance above?

Thanks.

Hi,

There are two DLAs hardware but only one GPU on Xavier.
So multithread will benefit DLA since two jobs can run on different DLAs without sharing resources.

Thanks.

Hello,

Do you mean two DLAs work simultaneously even if I release the command explicitly with “–useDLACore=0” option?

Thanks.

Hi,

If you have specified the DLA index, TensorRT should run the tasks on DLA-0.
Could you help to confirm by checking the DLA status?

$ watch -n 1 "cat /sys/devices/platform/host1x/15880000.nvdla0/power/runtime_status"
$ watch -n 1 "cat /sys/devices/platform/host1x/158c0000.nvdla1/power/runtime_status"

Thanks.

Hello,

I just confirmed DLAs work, but DLAs only do one by one.
–useDLACore=0 : DLA-0 works
–useDLACOre=1 : DLA-1 works

Could you tell me how to activate two DLAs at the same time in order to enhance the latency?

Thanks

Hi,

Have you tried to launch two TensorRT samples at the same time?
So you can deploy it on the different DLA.

This can only increase throughput but not latency.

Thanks.