Unstable performance across multiple Jetson AGX Xavier devices

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson AGX Xavier 64GB Industrial (Auvidea), Jetson AGX Xavier 64GB Industrial (Forecr), Jetson AGX Orin Developer board
• DeepStream Version 6.1.1, 6.2, 6.0.1
• JetPack Version (valid for Jetson only) 5.0.2, 5.1
• TensorRT Version 8.4.1, 8.5.1
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs) question/bug
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

I have a deepstream python application that is similar to the deepstream python apps rtsp-in-rtsp-out application. The application has a pipeline that can handle a variable amount of UDP multicast input streams through the use of source bins containing udpsource, rtpjpegdepay, jpegdecoder and nvvideoconvert elements. These source bins attach to the streammux, which is linked to the pgie (yolov8 nano model using deepstream-yolo implementation) and the pgie links to a NvDCF tracker. The tracker is either linked to a fakesink or a setup similar to the rtsp-in-rtsp-out app to restream the tiled OSD via multicast and RTSP. This application runs stably on an RTX2080, easily managing a throughput of 25FPS per input stream (realtime) for 4 input streams without any stutters and I can achieve a stable throughput of 17 FPS for 4 input streams on a Jetson AGX Orin developer board with Jetpack 5.0.2. When I try to run the same application on a Jetson AGX Xavier device, the FPS is all over the place, sometimes achieving near real-time throughput for 3/4 streams and sometimes almost no throughput on 3/4 streams. I have tested this on 2 different Xavier devices from 2 different manufacturers (Auvidea and Forecr) and with a variety of different settings, such as DLA and VA settings on the pgie, INT8 precision for the PGIE, turning off the restreamed OSD result, different batch size settings on the muxxer and pgie, power mode settings, jetson_clocks.

I am wondering if I’m doing anything wrong here, possibly missing a certain setting or feature that might be key to unlocking optimal performance on these devices. Should it be possible to run an application like this with 4x 960x1280 streams over UDP multicast on a deepstream app with a 640x640 pgie and a tracker on an industrial Xavier device or is it not powerful enough for this? The FPS also seems to be unstable on the Xavier devices when running just 2 input streams.

I’ll post some FPS probe data of all 3 Jetson devices and their tegrastats while running the app below this for reference. The FPS data is achieved using this code with a buffer probe on the pgie (probe on tracker yields similar results).

AGX Xavier 64GB Industrial (Auvidea) - Jetpack 5.0.2 - power mode 30W ALL - batch size 8 - INT8 - 4 streams - RTSP restream off

**PERF:  {'stream0': 1.2, 'stream1': 19.98, 'stream2': 0.0, 'stream3': 0.0}
**PERF:  {'stream0': 7.59, 'stream1': 17.39, 'stream2': 4.0, 'stream3': 0.8}
**PERF:  {'stream0': 5.0, 'stream1': 18.19, 'stream2': 1.4, 'stream3': 0.0}
**PERF:  {'stream0': 12.39, 'stream1': 16.39, 'stream2': 6.2, 'stream3': 1.0}
**PERF:  {'stream0': 9.39, 'stream1': 17.38, 'stream2': 0.2, 'stream3': 0.2}
**PERF:  {'stream0': 7.2, 'stream1': 16.79, 'stream2': 4.2, 'stream3': 0.4}
**PERF:  {'stream0': 13.99, 'stream1': 15.39, 'stream2': 3.8, 'stream3': 1.6}
**PERF:  {'stream0': 14.39, 'stream1': 16.59, 'stream2': 9.59, 'stream3': 9.79}
**PERF:  {'stream0': 16.58, 'stream1': 16.58, 'stream2': 15.98, 'stream3': 16.58}
**PERF:  {'stream0': 16.39, 'stream1': 16.39, 'stream2': 15.59, 'stream3': 16.19}
**PERF:  {'stream0': 16.39, 'stream1': 16.39, 'stream2': 10.39, 'stream3': 15.99}
**PERF:  {'stream0': 16.39, 'stream1': 16.39, 'stream2': 12.59, 'stream3': 15.99}
09-28-2023 12:19:29 RAM 8626/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [37%@1190,32%@1190,12%@1190,16%@1190,14%@1190,21%@1190,29%@1190,16%@1189] EMC_FREQ 38%@1600 GR3D_FREQ 0%@1377 NVENC 115 NVENC1 115 VIC_FREQ 67%@192 APE 150 AUX@46.5C CPU@49.5C thermal@48.6C Tboard@48C AO@46.5C GPU@50C Tdiode@49.25C PMIC@50C GPU 6740mW/6441mW CPU 1007mW/1016mW SOC 2718mW/2586mW CV 0mW/0mW VDDRQ 1607mW/1532mW SYS5V 3995mW/3951mW
09-28-2023 12:19:30 RAM 8626/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [32%@1189,23%@1190,14%@1190,18%@1190,12%@1190,6%@1189,24%@1190,25%@1190] EMC_FREQ 38%@1600 GR3D_FREQ 55%@1377 NVENC 115 NVENC1 115 VIC_FREQ 68%@204 APE 150 AUX@46.5C CPU@49.5C thermal@48.45C Tboard@48C AO@47C GPU@50C Tdiode@49C PMIC@50C GPU 6745mW/6442mW CPU 1006mW/1016mW SOC 2718mW/2586mW CV 0mW/0mW VDDRQ 1607mW/1532mW SYS5V 3995mW/3951mW
09-28-2023 12:19:31 RAM 8626/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [42%@1190,22%@1189,24%@1190,14%@1189,17%@1189,14%@1190,1%@1190,9%@1190] EMC_FREQ 37%@1600 GR3D_FREQ 29%@1377 NVENC 115 NVENC1 115 VIC_FREQ 68%@217 APE 150 AUX@46.5C CPU@49C thermal@48.75C Tboard@48C AO@47C GPU@50.5C Tdiode@49C PMIC@50C GPU 6941mW/6444mW CPU 906mW/1015mW SOC 2718mW/2587mW CV 0mW/0mW VDDRQ 1607mW/1533mW SYS5V 3995mW/3952mW
09-28-2023 12:19:32 RAM 8625/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [35%@1190,27%@1190,23%@1185,22%@1190,14%@1190,14%@1190,10%@1190,12%@1190] EMC_FREQ 38%@1600 GR3D_FREQ 21%@1377 NVENC 115 NVENC1 115 VIC_FREQ 68%@230 APE 150 AUX@46.5C CPU@49.5C thermal@48.9C Tboard@48C AO@47C GPU@50C Tdiode@49C PMIC@50C GPU 6841mW/6446mW CPU 906mW/1015mW SOC 2718mW/2587mW CV 0mW/0mW VDDRQ 1607mW/1533mW SYS5V 4034mW/3952mW
09-28-2023 12:19:33 RAM 8620/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [47%@1190,29%@1192,16%@1190,14%@1190,15%@1190,9%@1190,10%@1185,12%@1190] EMC_FREQ 38%@1600 GR3D_FREQ 0%@1377 NVENC 115 NVENC1 115 VIC_FREQ 71%@217 APE 150 AUX@46.5C CPU@49C thermal@48.45C Tboard@48C AO@47C GPU@50C Tdiode@49C PMIC@50C GPU 6640mW/6446mW CPU 1006mW/1015mW SOC 2718mW/2588mW CV 0mW/0mW VDDRQ 1607mW/1533mW SYS5V 3995mW/3952mW
09-28-2023 12:19:34 RAM 8620/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [50%@1190,27%@1192,19%@1190,13%@1190,16%@1190,21%@1190,22%@1191,20%@1189] EMC_FREQ 38%@1600 GR3D_FREQ 19%@1377 NVENC 115 NVENC1 115 VIC_FREQ 74%@153 APE 150 AUX@46.5C CPU@49C thermal@48.45C Tboard@48C AO@47C GPU@49.5C Tdiode@49C PMIC@50C GPU 6539mW/6447mW CPU 1007mW/1015mW SOC 2718mW/2588mW CV 0mW/0mW VDDRQ 1507mW/1533mW SYS5V 3995mW/3952mW
09-28-2023 12:19:35 RAM 8620/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [37%@1190,22%@1190,16%@1190,14%@1190,14%@1189,20%@1192,18%@1190,23%@1190] EMC_FREQ 38%@1600 GR3D_FREQ 48%@1377 NVENC 115 NVENC1 115 VIC_FREQ 68%@153 APE 150 AUX@46.5C CPU@49C thermal@48.6C Tboard@48C AO@47C GPU@50.5C Tdiode@49C PMIC@50C GPU 6841mW/6449mW CPU 1006mW/1015mW SOC 2718mW/2589mW CV 0mW/0mW VDDRQ 1607mW/1534mW SYS5V 3995mW/3953mW
EMC_FREQ 74%@1600 GR3D_FREQ 65%@1377 VIC_FREQ 34%@115 APE 150 AUX@49C CPU@53C thermal@52.6C Tboard@50C AO@50C GPU@56.5C Tdiode@52C PMIC@50C GPU 13739mW/7450mW CPU 1004mW/1042mW SOC 3110mW/2691mW CV 0mW/0mW VDDRQ 2403mW/1637mW SYS5V 4457mW/4019mW
09-28-2023 12:23:46 RAM 8563/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [33%@1189,17%@1190,39%@1186,15%@1190,42%@1190,5%@1190,13%@1190,16%@1190] EMC_FREQ 75%@1600 GR3D_FREQ 59%@1377 VIC_FREQ 40%@115 APE 150 AUX@49C CPU@52.5C thermal@52.6C Tboard@50C AO@50C GPU@55.5C Tdiode@52.25C PMIC@50C GPU 13447mW/7462mW CPU 1004mW/1042mW SOC 3010mW/2692mW CV 0mW/0mW VDDRQ 2305mW/1639mW SYS5V 4418mW/4020mW
09-28-2023 12:23:47 RAM 8563/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [20%@1189,14%@1192,25%@1190,3%@1187,46%@1190,2%@1189,9%@1189,11%@1189] EMC_FREQ 64%@1600 GR3D_FREQ 84%@1377 VIC_FREQ 34%@115 APE 150 AUX@49C CPU@52.5C thermal@51.85C Tboard@50C AO@49.5C GPU@55.5C Tdiode@51.75C PMIC@50C GPU 10449mW/7469mW CPU 905mW/1042mW SOC 2712mW/2692mW CV 0mW/0mW VDDRQ 1806mW/1639mW SYS5V 4151mW/4020mW
09-28-2023 12:23:48 RAM 8563/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [34%@1190,27%@1189,17%@1190,20%@1190,34%@1189,6%@1190,21%@1190,21%@1189] EMC_FREQ 61%@1600 GR3D_FREQ 70%@1377 VIC_FREQ 36%@115 APE 150 AUX@49C CPU@52.5C thermal@52.3C Tboard@50C AO@50C GPU@55.5C Tdiode@52C PMIC@50C GPU 11748mW/7477mW CPU 1004mW/1042mW SOC 2813mW/2692mW CV 0mW/0mW VDDRQ 2106mW/1640mW SYS5V 4301mW/4021mW
09-28-2023 12:23:49 RAM 8564/63219MB (lfb 12408x4MB) SWAP 0/31610MB (cached 0MB) CPU [29%@1190,40%@1190,9%@1189,6%@1190,19%@1190,26%@1190,16%@1190,24%@1190] EMC_FREQ 58%@1600 GR3D_FREQ 92%@1377 VIC_FREQ 46%@115 APE 150 AUX@49C CPU@52.5C thermal@52C Tboard@50C AO@50C GPU@56C Tdiode@52C PMIC@50C GPU 11153mW/7485mW CPU 1005mW/1042mW SOC 2813mW/2692mW CV 0mW/0mW VDDRQ 2005mW/1641mW SYS5V 4223mW/4021mW

AGX Xavier 64GB Industrial (Forecr) - Jetpack 5.1 - power mode 30W ALL - batch size 8 - INT8 - 4 streams - RTSP restream off

**PERF:  {'stream0': 16.79, 'stream1': 16.79, 'stream2': 7.99, 'stream3': 10.79}
**PERF:  {'stream0': 17.19, 'stream1': 16.99, 'stream2': 8.59, 'stream3': 12.19}
**PERF:  {'stream0': 14.59, 'stream1': 16.79, 'stream2': 6.39, 'stream3': 9.19}
**PERF:  {'stream0': 16.79, 'stream1': 16.99, 'stream2': 8.79, 'stream3': 7.39}
**PERF:  {'stream0': 16.99, 'stream1': 17.19, 'stream2': 5.6, 'stream3': 10.59}
**PERF:  {'stream0': 12.39, 'stream1': 17.19, 'stream2': 0.8, 'stream3': 2.0}
**PERF:  {'stream0': 0.8, 'stream1': 19.78, 'stream2': 0.0, 'stream3': 0.6}
**PERF:  {'stream0': 12.99, 'stream1': 17.59, 'stream2': 0.4, 'stream3': 0.2}
**PERF:  {'stream0': 9.59, 'stream1': 16.79, 'stream2': 5.2, 'stream3': 2.6}
09-28-2023 10:53:25 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [38%@1190,36%@1189,23%@1190,21%@1191,24%@1190,22%@1190,1%@1190,4%@1189] EMC_FREQ 41%@1600 GR3D_FREQ 85%@905 NVENC 115 NVENC1 115 VIC_FREQ 70%@230 APE 150 AUX@47.5C CPU@49C thermal@48.05C Tboard@47C AO@48C GPU@48C Tdiode@49.5C PMIC@50C GPU 3721mW/3656mW CPU 1026mW/1090mW SOC 2823mW/2823mW CV 0mW/0mW VDDRQ 1795mW/1731mW SYS5V 3554mW/3554mW
09-28-2023 10:53:26 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [32%@1191,25%@1190,24%@1190,15%@1190,33%@1190,30%@1185,3%@1190,2%@1190] EMC_FREQ 44%@1600 GR3D_FREQ 1%@905 NVENC 115 NVENC1 115 VIC_FREQ 73%@204 APE 150 AUX@47.5C CPU@49.5C thermal@47.75C Tboard@47C AO@48C GPU@48C Tdiode@49.5C PMIC@50C GPU 3977mW/3763mW CPU 1154mW/1111mW SOC 2951mW/2865mW CV 0mW/0mW VDDRQ 1923mW/1795mW SYS5V 3634mW/3580mW
09-28-2023 10:53:27 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [38%@1190,30%@1190,25%@1191,30%@1190,29%@1189,25%@1190,6%@1190,13%@1190] EMC_FREQ 44%@1600 GR3D_FREQ 18%@905 NVENC 115 NVENC1 115 VIC_FREQ 68%@307 APE 150 AUX@47.5C CPU@49.5C thermal@47.95C Tboard@47C AO@48C GPU@48C Tdiode@49.5C PMIC@50C GPU 4106mW/3849mW CPU 1154mW/1122mW SOC 2951mW/2887mW CV 0mW/0mW VDDRQ 1923mW/1827mW SYS5V 3634mW/3594mW
09-28-2023 10:53:28 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [35%@1190,30%@1189,21%@1190,20%@1190,29%@1189,16%@1188,4%@1190,3%@1190] EMC_FREQ 43%@1600 GR3D_FREQ 32%@905 NVENC 115 NVENC1 115 VIC_FREQ 68%@192 APE 150 AUX@47.5C CPU@49C thermal@48.1C Tboard@47C AO@48C GPU@47.5C Tdiode@49.5C PMIC@50C GPU 3721mW/3823mW CPU 1026mW/1102mW SOC 2823mW/2874mW CV 0mW/0mW VDDRQ 1795mW/1820mW SYS5V 3594mW/3594mW
09-28-2023 10:53:29 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [30%@1188,24%@1192,18%@1190,18%@1190,25%@1190,22%@1190,1%@1190,0%@1189] EMC_FREQ 42%@1600 GR3D_FREQ 44%@905 NVENC 115 NVENC1 115 VIC_FREQ 70%@204 APE 150 AUX@47C CPU@49C thermal@47.95C Tboard@47C AO@48.5C GPU@47.5C Tdiode@49.5C PMIC@50C GPU 3592mW/3784mW CPU 1026mW/1090mW SOC 2823mW/2865mW CV 0mW/0mW VDDRQ 1667mW/1795mW SYS5V 3554mW/3587mW
09-28-2023 10:53:30 RAM 14269/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [32%@1190,27%@1190,27%@1188,20%@1190,27%@1189,21%@1187,6%@1190,17%@1193] EMC_FREQ 41%@1600 GR3D_FREQ 90%@905 NVENC 115 NVENC1 115 VIC_FREQ 67%@243 APE 150 AUX@47.5C CPU@49C thermal@47.75C Tboard@47C AO@48C GPU@48C Tdiode@49.5C PMIC@50C GPU 3464mW/3739mW CPU 1154mW/1099mW SOC 2824mW/2859mW CV 0mW/0mW VDDRQ 1667mW/1776mW SYS5V 3554mW/3582mW
09-28-2023 10:53:31 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [33%@1190,21%@1190,25%@1190,22%@1190,20%@1190,23%@1189,0%@1190,0%@1190] EMC_FREQ 43%@1600 GR3D_FREQ 0%@905 NVENC 115 NVENC1 115 VIC_FREQ 75%@140 APE 150 AUX@47.5C CPU@49C thermal@47.75C Tboard@47C AO@48C GPU@47.5C Tdiode@49.5C PMIC@50C GPU 3721mW/3736mW CPU 1026mW/1090mW SOC 2823mW/2855mW CV 0mW/0mW VDDRQ 1795mW/1779mW SYS5V 3554mW/3579mW
09-28-2023 10:53:32 RAM 14270/63217MB (lfb 6765x4MB) SWAP 0/31608MB (cached 0MB) CPU [36%@1190,33%@1189,27%@1184,12%@1189,23%@1189,28%@1190,5%@1190,6%@1190] EMC_FREQ 42%@1600 GR3D_FREQ 56%@905 NVENC 115 NVENC1 115 VIC_FREQ 73%@217 APE 150 AUX@47C CPU@49C thermal@47.95C Tboard@47C AO@48C GPU@48C Tdiode@49.5C PMIC@50C GPU 3721mW/3735mW CPU 1154mW/1097mW SOC 2823mW/2851mW CV 0mW/0mW VDDRQ 1795mW/1780mW SYS5V 3594mW/3580mW

AGX Orin Developer board - Jetpack 5.0.2 - power mode 30W (8 cores) - batch size 8 - INT8 - 4 streams - RTSP restream off

**PERF:  {'stream0': 24.97, 'stream1': 24.97, 'stream2': 25.17, 'stream3': 24.97}
**PERF:  {'stream0': 24.98, 'stream1': 24.98, 'stream2': 24.78, 'stream3': 24.78}
**PERF:  {'stream0': 24.58, 'stream1': 24.98, 'stream2': 24.98, 'stream3': 24.78}
**PERF:  {'stream0': 23.98, 'stream1': 24.98, 'stream2': 24.98, 'stream3': 24.98}

AGX Orin Developer board - Jetpack 5.0.2 - power mode 30W (8 cores) - batch size 8 - INT8 - 4 streams - RTSP restream on

**PERF:  {'stream0': 19.78, 'stream1': 19.58, 'stream2': 19.58, 'stream3': 19.38}
**PERF:  {'stream0': 18.98, 'stream1': 18.98, 'stream2': 18.98, 'stream3': 18.78}
**PERF:  {'stream0': 18.98, 'stream1': 19.18, 'stream2': 18.98, 'stream3': 18.78}
**PERF:  {'stream0': 18.99, 'stream1': 18.79, 'stream2': 18.59, 'stream3': 18.39}
**PERF:  {'stream0': 19.18, 'stream1': 19.18, 'stream2': 19.18, 'stream3': 18.98}

To me it looks like batch pgie batch size isn’t really the impacting factor, as the spikiness in performance on the Xavier’s seems to occur on batch size 4, 8 and 16 alike across both Xavier devices. RTSP restream on/off does only really seem to impact the Orin device, reducing the FPS per stream by ~ 5 frames. Switching the RTSP restream on/off does not really affect the throughput for both Xavier devices, they remain having spiky throughput that varies a lot.

Hi, I am in the same team as @jeroen.denboef
We have made a detailed runtime profile to better analyze the performance issues. This might be useful to answer the original question. I also added additional questions as a result of this analysis.

TL/DR: Below we added some detailed profiles to better debug the instability in performance mentioned above. We are wondering why the Xavier is not able to proces all images correctly.

Setup:
For the following profiles we setup a Xavier and an Orin with 2 udp-streams that represent the camera’s
The Streammux is configured to wait at most 40ms to kickoff a new batch. (This aligns with the camera’s 25fps rate)

Picture1:
Here we see a zoomed out screenshot of the runtime profile from a Xavier
On the top we see the PGIE activity, followed by the 2 udp sources for the camera’s.

In the first udp-source we see the expected behavior where approximately every 40ms some activity happens that indicates that frames are coming in.

In the second udp-source we see that there are gaps within this activity.
(We are quite sure that the connection is not the issue since this occurs with multiple sources and the Orin does not have this issue as can be seen in Picture3)

In the next image we zoom in to the area between the orange lines.

Picture 2:
In the first udp-source we see that each ±40ms a “gst_nvvideoconver_transform()” function is called. above these blocks the OS runtime is calling a sequence of “ioctl” and “phthread_mutex_lock”.
In the second udp-source the OS-runtime is only calling sequences of “poll” and “recvmsg” at the time that there is a “gap” in the incoming frames.
Our First question is, Why does this happen? and can this be a result of a overloaded GPU?"

On the right on this image we see that it can happen that 2 batches are launched in parallel.
This means that the PGIE will process will have to perform two batches in sequence, if this happens for multiple batches in a row can this result in a GPU bottleneck? and can we therefore interpret the “futex” block as a measure of how much margin we have on the GPU?

Picture 3
Below is the same view as in Picture1 but now for the Orin. Here we see that there are no gaps in the udp-sources. The same code is running, configured with the same sources.

Hi,
Please upgrade to Jetpack 5.1.2 and give it a try. If it is still present, please share the steps and we can set up Xavier developer kit to replicate the phenomenon.

Thanks for looking into this. As we are using industrial Jetson devices manufactured and sold by NVIDIA partnered suppliers, we are reliant on their custom L4T kernels, of which Jetpack 5.1 is the latest one currently available. Which Jetpack 5.1.2 updates specifically would be helpful for our throughput, maybe we can look into backporting something?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

It’s difficult to find the specific package, you may need to check with your vendor on the latest Jetpack upgrade.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.