Resnet50 with DLA takes 2x more latency than with just GPU

Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.7.0.8846
other

Host Machine Version
native Ubuntu 18.04
other

We ran a simple resnet50.onnx on AGX with --fp16 --useDLACore=0 --allowGPUFallback; The latency and throughput are significantly worse. I see all layers running on DLA but the profile seems to indicate that “data_copy_finish” is taken most of the latency.

Any idea what is going on? Also why does the profile seem to show all layers being run 3-times. Attached is the profile.json for the run with DLA.
resnet_fp16_dla.json (10.5 KB)

Dear @user3705,
May I know how did you profile your application? Could you share the complete command?

Hi

I used —fp16 —useDLACore=0 —allowGPUFallback —exportProfile=profile.json

__________________________________________________________________ CONFIDENTIALITY NOTE: This electronic message (including any attachments) may contain information that is privileged, confidential, and proprietary. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this electronic message in error, please immediately reply to the sender that you have received this communication and destroy the material in its entirety, whether in electronic or hard copy format. Although Rivian has taken reasonable precautions to ensure no viruses are present in this email, Rivian accepts no responsibility for any loss or damage arising from the use of this email or attachments.

Hi, @user3705
Did you mean “–useDLACore=1”?

Hi,

There are two DLA cores… we can use either core 0 or core 1

Sorry. I misunderstood the option. Please ignore my question.

Please see if the below information in FAQ can help with your questions on this topic.

Q: Why does my network run slower when using DLA compared to without DLA?
A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Which implementation to use depends on your latency or throughput requirements and your power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations at the same time to further increase the throughput of your network.

Dear @user3705,
If you are referring to name parameter value in the Json, I could see the name value string is different at the end attached json file.

The latency and throughput are significantly worse

If you are comparing the performance w.r.t to iGPU, It can be expected as iGPU is generally more powerful compared to DLA engine. The idea of having DLA is to offload network onto it to be able to run two networks in parallel(on GPU and DLA).

Hi,

But it is not the compute that is taking most of the latency; it is some dataTransfer time called “data_copy_finish”. Why is that so much more in DLA. It is pretty much taking over 90% of the total latency.

Dear @user3705,
Could you share your model to get more insights. I notice you are using resnet50 model not shipped in tensorrt samples.

Hi,

I am not able to upload the model… seems to be large and fails to upload. Let me know if there another way to share it.

Thanks

Dear @user3705,
You can upload to a public share drive and given URL details here.

Hi

You can try this.

https://github.com/onnx/models/blob/master/vision/classification/resnet/model/resnet50-v1-7.onnx

It should be similar.

Thanks

Dear @user3705,
In the above shared model, I notice few layers are running on GPU which requires data transfer between DLA and GPU to share intermediate outputs. This causes increase in overall time.

This network does not have any special layers and all layer intermediate input are well within the bounds. Any idea why layers are falling back to GPU? Resnet50 is a simple enough network to run on a DLA.