Does DLA work faster than GPU in fp16 model?

insup.choi · May 16, 2022, 2:40am

Hello.

I am trying to migrate MobileNetV2 to AGX Xavier with DLA conversion.

I just removed the end of model because of the tensorrt conversion error about GlobalAveragePool and Gemm layer.
Then I successfully convert MobileNetV2 to rtr files just for DLA and GPU with fp16 format.
but the latency time seems to be weird
DLA 16fp : 6.25089 ms
GPU 16fp : 2.88255 ms

DLA only model must be faster than GPU only, isn’t it?

Thanks.

AastaLLL · May 16, 2022, 4:04am

Hi,

Please noted we don’t expect DLA run faster than GPU.
You can find more details below:

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-820-ea/developer-guide/index.html#troubleshooting

Q: Why does my network run slower when using DLA compared to without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Which implementation to use depends on your latency or throughput requirements and your power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations at the same time to further increase the throughput of your network.

For the performance issue, would you mind sharing the log with -dumpProfile with us?

$ /usr/src/tensorrt/bin/trtexec --dumpProfile ...

Thanks.

insup.choi · May 16, 2022, 5:10am

dumpProfile.log (16.3 KB)
I attached the dumpProfile file

Is DLA mainly for the energy efficiency and finally harmful the latency?
But I have ever read some document that working on DLA is faster than GPU ( unfortunately I can’t search it OTL.)

Thanks.

AastaLLL · May 17, 2022, 6:09am

Hi,

DLA inference might be slower if some fallback is enabled.
But in general, it should give you a similar performance as GPU.

Would you mind attaching the profiling data of GPU mode as well?
This will help us to compare the performance at the layer level.

More, in case you don’t aware of this.
You should boost the device with the following command before benchmarking:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

insup.choi · May 17, 2022, 6:37am

Hello,

I attached the followings
could you please look them into?
GPU_fp16.log : GPU only mode
DLA_fp16.log : DLA only mode

Thanks.

GPU_fp16.log (28.8 KB)
DLA_fp16.log (14.2 KB)

AastaLLL · May 19, 2022, 6:49am

Hi,

Thanks for your sharing.

The performance from DLA is much slower at the layer level as well.
Is it possible to share the model with us so we can check it further?

Thanks.

insup.choi · May 19, 2022, 7:37am

Hello,

I attached two modes.

Thanks
resnet50_sim_mod_DLA_fp16.trt (57.2 MB)
resnet50_sim_mod_GPU_fp16.trt (47.2 MB)

AastaLLL · May 20, 2022, 1:05am

Hi,

Sorry for the non-clear statement.

Would you mind sharing the original ONNX model with us?
Since the TensorRT engine is not portable, this will help us test on different platforms and software versions.

Thanks.

insup.choi · May 20, 2022, 1:13am

Hello,

I attached onnx model file.
resnet50_sim_mod.onnx (89.6 MB)

Thanks.

AastaLLL · May 20, 2022, 8:59am

Hi,

Thanks for sharing the model.

Confirmed that we can reproduce the same performance issue internally.
We are checking this with our internal team. Will share more information with you later.

insup.choi · May 24, 2022, 1:43am

Hello,

At the same time I also measured some values with following commands
$ time runTest.sh & runTest.sh & runTest.sh & runTest.sh &
It can be to run 4 processes simultaneously.

It can mean these

The execution time of every single DLA units is much than GPU’s
The execution time of total DLA units is less than GPU’s

Could you explain why it is?
Commonly the execution time of total DLA units must also be much than GPU’s, isn’t it?

Thanks.

AastaLLL · May 24, 2022, 5:26am

Hi,

We got some feedback from our internal team.

For Xavier, GPU is 20TOPs while DLA is 5TOPs.
So purely from the TOPs point of view, GPU is 4x faster than DLA.

Thanks.

insup.choi · May 30, 2022, 12:28am

Hello,

Could you please reply my questions about multi-process performance above?

Thanks.

AastaLLL · May 30, 2022, 5:54am

Hi,

There are two DLAs hardware but only one GPU on Xavier.
So multithread will benefit DLA since two jobs can run on different DLAs without sharing resources.

Thanks.

insup.choi · May 30, 2022, 6:34am

Hello,

Do you mean two DLAs work simultaneously even if I release the command explicitly with “–useDLACore=0” option?

Thanks.

AastaLLL · June 6, 2022, 5:36am

Hi,

If you have specified the DLA index, TensorRT should run the tasks on DLA-0.
Could you help to confirm by checking the DLA status?

$ watch -n 1 "cat /sys/devices/platform/host1x/15880000.nvdla0/power/runtime_status"
$ watch -n 1 "cat /sys/devices/platform/host1x/158c0000.nvdla1/power/runtime_status"

Thanks.

insup.choi · June 7, 2022, 3:32am

Hello,

I just confirmed DLAs work, but DLAs only do one by one.
–useDLACore=0 : DLA-0 works
–useDLACOre=1 : DLA-1 works

Could you tell me how to activate two DLAs at the same time in order to enhance the latency?

Thanks

AastaLLL · June 8, 2022, 6:21am

Hi,

Have you tried to launch two TensorRT samples at the same time?
So you can deploy it on the different DLA.

This can only increase throughput but not latency.

Thanks.

system · June 29, 2022, 6:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DLA purpose Jetson AGX Xavier	2	6220	October 18, 2021
Deep Learning Accelerator problems DRIVE AGX Xavier General	2	1479	October 12, 2021
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10546	October 18, 2021
Big difference between using DLA core and not using DLA core Jetson Xavier NX tensorrt , dla	4	3121	October 18, 2021
DLA / GPU question Jetson AGX Xavier dla	6	1020	October 18, 2021
Why run slower when use DLA and GPU together , even if the DLA model was transfromed all in DLA? Jetson Xavier NX dla	7	1324	October 18, 2021
Performance about igpu and dla DRIVE AGX Xavier General driveos-dl	9	1356	October 12, 2021
how to use DLA Jetson AGX Xavier	4	1443	October 18, 2021
DLA makes inference much slower TensorRT	0	564	December 23, 2019
DLA and GPU running at the same time, performance degradation Jetson Xavier NX dla	2	682	October 18, 2021

Does DLA work faster than GPU in fp16 model?

Related topics