How to get reliable GPU clock rate on Jetson AGX Xavier?

Hello,

I’m using Jetson AGX Xavier with TensorRT to run object detection with RetinaNet for my master’s thesis. Sadly, I’m facing a couple of issues. My pipeline is mmdetection -> ONNX -> TensorRT 6 and all tests are run in FP16 mode with batch size 1.

I have problems getting over 50 fps by adjusting image size, detection head or backbone. After some research I found that the GPU is getting clocked down during image loading and the clock isn’t getting ramped up fast enough for the inference. The two solutions I tested so far are either running inference on the same image multiple times and only timing the last pass or to set a minimum frequency for the GPU with /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/min_freq. The first solution makes the evaluation really slow. With the second solution I am not sure whether the 30 W limit is enforced. Is it still enforced or can the power consumption exceed 30 W in this case? Is there any other way to ramp up the GPU clock faster when the inference starts?

Is there some way to reliably measure the power consumption? I have seen the /sys/bus/i2c/drivers/ina3221x/1-0040/iio:device0/in_power*_input files in the Jetson documentation that should report the power consumption of GPU, CPU and SOC in mW (https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide%2Fpower_management_jetson_xavier.html%23wwpID0E0YF0HA). However, during inference in 30 W mode with default clock settings the three files show about 5000 in total. 5 W seems to be way too little. So, is this measured in 10 mW instead of 1 mW? But 50 W would be too much for the 30 W mode.

Is there some more detailed documentation what is meant by optimized power budget in the documentation (https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide%2Fpower_management_jetson_xavier.html%23wwpID0E0LO0HA)? I especially would like to know how the budget is shared between CPU and GPU.

Thank you in advance for any support.

Environment

TensorRT Version: 6.0.1
GPU Type: Volta (Jetson AGX Xavier)
Nvidia Driver Version: JetPack 4.3
CUDA Version: 10.0
CUDNN Version: 7.6.3
Operating System + Version: JetPack 4.3
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.3.0
Baremetal or Container (if container which image + tag): Baremetal

Hi,

1
Camera to inference is a complicated problem.
To save the non-necessary memory copy is the key of performance.

A suggestion pipeline is to read camera input to the DMA buffer and make it GPU accessible via EGL API.
You can find some sample in our Deepstream SDK: https://developer.nvidia.com/deepstream-sdk

So when you profile the same image multiple time, do you count the memory copy time?
If yes, you can get a better result with Deepstream SDK.

To setup power mode, you can try our nvpmodel tool:
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide%2Fpower_management_jetson_xavier.html%23wwpID0E0LO0HA

Your use case should be this: (max frequency for 30W)

$ sudo nvpmodel -m 3
$ sudo jetson_clocks

2
You can check this page for some information:

Thanks.