AGX Xavier thermal limit

Please provide complete information as applicable to your setup.

**• Hardware Platform (Jetson / GPU)**AGX Xavier
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4

What’s the thermal limit for running apps in AGX Xavier?

I am running the same 4 pipelines to undistort a video using deepstream plugin using cuda opencv. One test run in “30W ALL” mode with fan set at 255. Everything seems normal except frame rate is less than expected. Then I change to “MAXN” mode, fan still set at 255, after couple minutes running, the system crash due to:

[ 512.088367] nvgpu: 17000000.gv11b gk20a_channel_timeout_handler:1570 [ERR] Job on channel 509 timed out
[ 512.089276] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 509

Then I check tegrastats log, at the moment of above error, the GPU temperature is around 45C:

RAM 13080/31919MB (lfb 4158x4MB) SWAP 0/15959MB (cached 0MB) CPU [99%@2265,38%@2265,36%@2265,38%@2265,43%@2265,47%@2265,36%\
@2265,38%@2265] EMC_FREQ 0% GR3D_FREQ 97% AO@40.5C GPU@45C Tdiode@44.25C PMIC@100C AUX@39C CPU@43C thermal@41.7C Tboard@39C\
 GPU 13624/8571 CPU 4592/3326 SOC 6733/4620 CV 0/0 VDDRQ 2145/1429 SYS5V 3451/3016

What is the highest temperature that GPU can run at without crash? I wonder if 45C is the thermal limit of GPU? if so, then what does the thermal spec of AGX Xavier -25C to 80C mean?

Attached are serial console log after_reflash_4_undistort_crash.log (33.7 KB) and tegrastats log after_reflash_4_undistort_crash_tegrastats.log (260.7 KB) when nvgpu_set_error_notifier_locked happened.

I have not used an AGX Xavier, but the Thermal Design Guide for the platform clearly states that the maximum temperature at the TTP (Thermal Transfer Plate) must no exceed 80 deg C, for both “30W ALL” and “MAXN” modes. FWIW, I see no indication in the log that CPU or GPU hit that, or any other, thermal limit. Did I overlook something?

what you observe is correct, I don’t see any of thermal (GPU/CPU/etc) hit 80C limit, actually it’s way below (as I saw the peak is 45C). That’s exactly my question: “why does it the gpu error occur whenever GPU temperature > 45C?” (so far that’s my experience, need further validation) or the other way to ask this question: “could you make a case you can run multiple apps and make GPU > 50C without any gpu error occur?” or “how high the GPU temperature you can run without gpu error occur?” Make sense?

“why does it the gpu error occur whenever GPU temperature > 45C?”

This question implies a causal relationship for which there seems to be no motivation in the available data (the log linked here). Since the recorded temperature is not anywhere close to the hardware limits – in fact, quite low at 45 deg Celsius – it stands to reason that the issue observed has, in all likelihood, nothing to do with the device temperature.

I have not used this platform and cannot interpret the details of the log output. The issue could be due to a bug in the firmware or elsewhere in the software stack, or maybe a hardware issue like an insufficiently sized power supply (generally speaking, a common source of flakiness).

I would suggest asking about the issue in the sub-forum dedicated to the AGX Xavier. There should be a lot more participants there who have hands-on experience with this platform.