Model inference Energy consumption of DLA on AGX Orin benchmark problem

Recently, I have been interested in the Energy consumption of DLA on AGX Orin, and I found the following benchmark on GitHub.

I plan to reproduce it. However, the results I obtained do not match the results you presented. Here, I take Resnet50 and SSD-ResNet-34 as examples, and hope you can provide the test scripts, or explain the calculation method of the energy consumption metrics.

For Resnet50:

i use following command for DLA :

/usr/src/tensorrt/bin/trtexec --useDLACore=0 --iterations=10000 --int8 --memPoolSize=dlaSRAM:1 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --onnx=./models/resnet50_v1_prepared.onnx --saveEngine=./models/resnet50_v1_prepared_dla.trt --shapes=input_tensor:0:2x3x224x224

and the following command for GPU only:

 /usr/src/tensorrt/bin/trtexec --allowGPUFallback --iterations=10000 --int8 --onnx=./models/resnet50_v1_prepared.onnx --saveEngine=./models/resnet50_v1_prepared_gpu.trt  --shapes=input_tensor:0:2x3x224x224

Then, with tegrastat, i got the following energy consumption result at MAXN power mode :
8a1c4df58ba4380038ecd665595792d

This result seems to indicate that, if considering the overall power consumption(GPU_SOC + CPU_CV + SYS_5V0), the pure DLA approach appears to bring more power consumption.(Does it mean DLA cause other energy consumption?) However, the results you provided are shown in the following figure. I am wondering how this result was calculated?

For SSD-ResNet-34:

DLA commands :

/usr/src/tensorrt/bin/trtexec --useDLACore=0 --loadEngine=./models/${MODEL_NAME}_dla.trt --batch=${BATCH_SIZE} --iterations=1000 --streams=1

GPU only command:

/usr/src/tensorrt/bin/trtexec --loadEngine=./models/${MODEL_NAME}_gpu.trt --batch=${BATCH_SIZE} --iterations=1000 --streams=1 

Result of MAXN power mode:
cd82211274a36251b075678a3278b9e

Your Result:

Thank you. Looking forward to your reply.

Hi,

The ratio is much more straightforward.

Under the given power mode (ex. MAXN), the ratio = qps of DLA / qps of GPU.

Thanks.

However, the results of my tests found that in terms of QPS, DLA is not as high as GPU only, as shown in the following figure(resnet34-ssd1200): .

DLA:

GPU only:

Is there something wrong with my command?

Moreover, so far, I have NEVER found a case where the QPS of DLA is greater than that of GPU only. Can you provide such an example? Thank you.

Hi,

Just want to confirm, do you use the same command as below?

https://github.com/NVIDIA/Deep-Learning-Accelerator-SW/blob/main/scripts/prepare_models/README.md/#prepare--run-2

Thanks.

Hi,

Sorry that my previous comment is incorrect.

The ratio is not calculated with qps of DLA / qps of GPU.
It should be ( DLA qps/W ) / ( GPU qps/W ) ( DLA q/J ) / ( GPU q/J ).

We are checking with our internal team for more details.
Will update to you later.

Thanks.

OK, thank you. At the same time, I would also appreciate it if you could provide the commands for both GPU-only and DLA. Currently, only the DLA commands are available on GitHub, and currently I have written the commands for GPU-only myself.

Hi,

The command to run on GPU can be found below.
Just remove the --useDLACore=0 and TensorRT will use the default GPU for inference.

/usr/src/tensorrt/bin/trtexec --iterations=10000 --int8 --memPoolSize=dlaSRAM:1 --inputIOFormats=int8:dla_hwc4 --outputIOFormats=int8:chw32 --onnx=./models/resnet50_v1_prepared.onnx --saveEngine=./models/resnet50_v1_prepared_dla.trt --shapes=input_tensor:0:2x3x224x224

We have some discussion internally, more precisely, the score is measured with q/J.
Unfortunately, we cannot share the power consumption on the forum.

Thanks.

OK, thank you, I see. However, I am still curious:

  1. Why is there a non-negligible power consumption on the GPU when using a model obtained with TensorRT that is intended for pure DLA?
  2. Can I ignore this GPU power consumption and directly test the power consumption of the CPU and DLA as the power consumption of the DLA only method?

Hi,

1. Do you use the jetson_clocks to lock the GPU clock to the maximum?
Moreover, some system-level applications might still need GPU (ex. rendering).

2. Unfortunately no, the measured power will still have some non-ignorable GPU power consumption.

Thanks.

No, I didn’t use jetson_clocks because I noticed that your experiments were conducted in MAXN mode. I used the command

nvpmodel -m 0 

for setting power mode.

and additionally, I’m quite sure that I turned off all desktop and other interface displays.

Hi,

You can try to set the GPU to a low frequency to reduce the power consumption.

But please note that the measured power is still a systemwide consumption.

Thanks.

Thank you for your response. However, I conducted another test where I locked both the GPU and CPU frequencies to their lowest while using DLA. During the GPU-only test, I locked the GPU and CPU frequencies at their highest. The results were different from before, but still far from what I was expecting.

Additionally, from the tensorrt test results, it’s evident that the QPS of the model on DLA is significantly lower than in the GPU-only scenario

DLA scenario :

GPU-only scenario :

Hi,

DLA is lower in qps but higher in q/J.

Sorry that we cannot share any power data about DLA.
So the information we can share is really limited.

Thanks.

OK, I see. Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.