AGX Xavier thermal limit

ynjiun · September 5, 2020, 9:54pm

Please provide complete information as applicable to your setup.

**• Hardware Platform (Jetson / GPU)**AGX Xavier
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4

What’s the thermal limit for running apps in AGX Xavier?

I am running the same 4 pipelines to undistort a video using deepstream plugin using cuda opencv. One test run in “30W ALL” mode with fan set at 255. Everything seems normal except frame rate is less than expected. Then I change to “MAXN” mode, fan still set at 255, after couple minutes running, the system crash due to:

[ 512.088367] nvgpu: 17000000.gv11b gk20a_channel_timeout_handler:1570 [ERR] Job on channel 509 timed out
[ 512.089276] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 509

Then I check tegrastats log, at the moment of above error, the GPU temperature is around 45C:

RAM 13080/31919MB (lfb 4158x4MB) SWAP 0/15959MB (cached 0MB) CPU [99%@2265,38%@2265,36%@2265,38%@2265,43%@2265,47%@2265,36%\
@2265,38%@2265] EMC_FREQ 0% GR3D_FREQ 97% AO@40.5C GPU@45C Tdiode@44.25C PMIC@100C AUX@39C CPU@43C thermal@41.7C Tboard@39C\
 GPU 13624/8571 CPU 4592/3326 SOC 6733/4620 CV 0/0 VDDRQ 2145/1429 SYS5V 3451/3016

What is the highest temperature that GPU can run at without crash? I wonder if 45C is the thermal limit of GPU? if so, then what does the thermal spec of AGX Xavier -25C to 80C mean?

Attached are serial console log after_reflash_4_undistort_crash.log (33.7 KB) and tegrastats log after_reflash_4_undistort_crash_tegrastats.log (260.7 KB) when nvgpu_set_error_notifier_locked happened.

njuffa · September 5, 2020, 10:49pm

I have not used an AGX Xavier, but the Thermal Design Guide for the platform clearly states that the maximum temperature at the TTP (Thermal Transfer Plate) must no exceed 80 deg C, for both “30W ALL” and “MAXN” modes. FWIW, I see no indication in the log that CPU or GPU hit that, or any other, thermal limit. Did I overlook something?

ynjiun · September 6, 2020, 12:06am

what you observe is correct, I don’t see any of thermal (GPU/CPU/etc) hit 80C limit, actually it’s way below (as I saw the peak is 45C). That’s exactly my question: “why does it the gpu error occur whenever GPU temperature > 45C?” (so far that’s my experience, need further validation) or the other way to ask this question: “could you make a case you can run multiple apps and make GPU > 50C without any gpu error occur?” or “how high the GPU temperature you can run without gpu error occur?” Make sense?

njuffa · September 6, 2020, 2:39am

“why does it the gpu error occur whenever GPU temperature > 45C?”

This question implies a causal relationship for which there seems to be no motivation in the available data (the log linked here). Since the recorded temperature is not anywhere close to the hardware limits – in fact, quite low at 45 deg Celsius – it stands to reason that the issue observed has, in all likelihood, nothing to do with the device temperature.

I have not used this platform and cannot interpret the details of the log output. The issue could be due to a bug in the firmware or elsewhere in the software stack, or maybe a hardware issue like an insufficiently sized power supply (generally speaking, a common source of flakiness).

I would suggest asking about the issue in the sub-forum dedicated to the AGX Xavier. There should be a lot more participants there who have hands-on experience with this platform.

Topic		Replies	Views
AGX Xavier power supply: very sensitive to voltage variation Jetson AGX Xavier power , nvbugs	31	3588	October 18, 2021
DRIVE AGX Thermal Specifications? DRIVE AGX Xavier General drive-platform-design	12	2174	October 12, 2021
AGX Xavier freeze in MAXN mode Jetson AGX Xavier power	36	5797	October 18, 2021
Jetson Xavier Developer Kit very high temperature GPU Jetson AGX Xavier hw	5	967	October 18, 2021
GPU Over heating over 95C - stable defusion Jetson AGX Xavier gpu	2	555	November 3, 2023
We need the Industry Grade (－40 ~ + 85°C） AGX Xavier module Jetson AGX Xavier	12	1063	October 18, 2021
Deepstream cudaErrorInitializationError error on Xavier AGX DeepStream SDK	3	831	October 12, 2021
XavierNX JP5.0.2 run deepstream got system crash Jetson Xavier NX thermal , deepstream	4	503	May 3, 2023
DeepStream GPU temperature issue (running samples) DeepStream SDK	2	952	February 8, 2018
Agx xavier/jp5.1 GPU performance Fall short of expectationsa DeepStream SDK deepstream , chinese	4	96	September 23, 2024

AGX Xavier thermal limit

Related topics