DLA performance less (around half) than what's expected

Greetings everyone,

I have a Jetson AGX Orin 64GB and I’m testing it with the models available in NVIDIA’s Deep Learning Accelerator repo, and concretely on the “Orin Dense Performance” section of the page. Find it here: GitHub - NVIDIA/Deep-Learning-Accelerator-SW: NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

I have cloned the repo then successfully followed the step to download the .onnx models and executing it using their provided command lines in the README.md at Deep-Learning-Accelerator-SW/scripts/prepare_models/README.md. However, when I try to reproduce their results reported on the “DLA dense performance” section, my performance is quite below from theirs.

Logs on verbose mode are here. I can only put four links in my post as a restriction for being a new user, so I’ll attach these two, but the other four models were presenting the same behavior.

log_resnet50_MAXN.txt (246.6 KB)
log_ssd_mobilenetv1_MAXN.txt (242.2 KB)

  • ResNet-50: theirs is 2037 fps, mine is 504 qps * 2 (batch) = 1008 fps.
  • SSD-MobileNetV1: theirs is 2664 fps, mine is 655 qps * 2 (batch) = 1310 fps.

For the other models whose logs were omitted in this comment:

  • RetinaNet ResNeXt-50: theirs is 78 fps, mine is 39 fps
  • RetinaNet ResNet-34: theirs is 108 fps, mine is 53 fps
  • SSD-ResNet-34: theirs is 83 fps, mine is 41 fps

My results seems to be constantly around half of their reported results. I copy and paste their command lines for execution so I don’t think I’m missing an option here. I double checked that I was on MAXN power mode. I do not understand what I’m missing.

Thanks in advance :)

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

3. Tutorial

Startup deep learning tutorial:

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

Sure, I replicated the process having executed these lines, the performance remaining the same.

sudo nvpmodel -m 0
sudo jetson_clocks

We can start troubleshooting with resnet50, from from Deep-Learning-Accelerator-SW repo:
resnet50_v1_prepared.zip (90.6 MB)

The command is this one, copy and pasted from Deep-Learning-Accelerator-SW repo:

trtexec --useDLACore=0 --int8 --memPoolSize=dlaSRAM:1 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --onnx=resnet50_v1_prepared.onnx --shapes=input_tensor:0:2x3x224x224 --verbose

The log I get is as follows:
log_resnet50_MAXN.txt (246.6 KB)

Their reported performance is 2037 fps, while mine is about 505 qps for batch 2, i.e. 1010 fps.

Hi,

There are two DLAs hardware on the Orin.
The profiling data in our GitHub should be value measured with two DLAs.

Do you run the benchmark on DLA0 and DLA1 concurrently and sum the throughput?

Thanks.

Hi! Thanks for your answer,

So, the way I was running it in my previous result was copying and pasting the command provided in the repo:

trtexec --useDLACore=0 --int8 --memPoolSize=dlaSRAM:1 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --onnx=resnet50_v1_prepared.onnx --shapes=input_tensor:0:2x3x224x224 --verbose

Doing this on one terminal would yield for me approximately half of the performance reported by NVIDIA.


Now, reading your comment, I proceed to run two instances concurrently on two different terminals, while adding the option --duration=10 to ensure that they are being executed concurrently.

Terminal #1:

trtexec --useDLACore=0 --int8 --memPoolSize=dlaSRAM:1 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --onnx=resnet50_v1_prepared.onnx --shapes=input_tensor:0:2x3x224x224 --verbose --duration=10 > infer_dla0_concurrent_v2.log

Terminal #2:

trtexec --useDLACore=1 --int8 --memPoolSize=dlaSRAM:1 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --onnx=resnet50_v1_prepared.onnx --shapes=input_tensor:0:2x3x224x224 --verbose --duration=10 > infer_dla1_concurrent_v2.log

Logs are here:
infer_dla1_concurrent_v2.log (275.6 KB)
infer_dla0_concurrent_v2.log (275.7 KB)

The performance is around 410 qps * 2 batch_size = 920fps * 2 cores = 1840 fps. This is certainly better if you sum the performance of both cores. That said, I still don’t get that ~510qps to get to the reported 2037fps by the repo, I am off by ~100qps per concurrent execution.

On your repo, it is written as follows:

2x DLA images per second on a Jetson AGX Orin 64GB…

I wonder, the results reported were only ran on a single core and then multiplied by two to simulate the execution of both cores? Because if that is the case, then my reported results of 505qps * 2 (batch) = 1010fps would fit quite good, I would only need to double my throughput to get the ~2020fps reported on the repo readme.

Thanks again! :)

Hi,

Do you mean your results are 1840 but our table shows 2037 qps?

If so, the difference might come from different software. (<10%)
Do you also use JetPack 5.1.1 and boost the device’s performance with the following command?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.