Several Jetson Orin 64GB Devkit sudennly power-off unexpectedly used in our product body under 25℃

We have integrated the Jetson Orin 64GB Devkit into our product for algorithm execution, housed within the product’s external casing. We’ve encountered an issue where a total of 6 units have experienced unexpected power-offs during testing within the past month, after which they can resume operation upon repowering. The issue is difficult to reproduce; at its most frequent, it occurred 3 times within an hour, while ordinarily, it might happen occasionally over several days. The testing environment is indoors and air-conditioned. Please refer to the screenshot for resource utilization: GPU is at 98%, CPU is around 50-70%, and the power mode is set to maximum performance. We have already ruled out insufficient external power supply capability (our own board can supply up to 164.5A) and connectivity issues. Due to the sporadic nature of the power-offs, relevant logs were not saved previously. The relevant system configuration is shown in the figure.

Before the power loss, the following information was seen in the machine’s kernel log. It appears to be related to thermal issues.


Check how to locate the issue, and if it is caused by high resource usage leading to overcurrent or temperature problems that trigger hardware protection, what are the safety usage thresholds, and how to design for safe operating conditions. The syslog before shutdown has been uploaded and can be used for detailed analysis.

*** Please note that this reply is generated by LLM automatically ***

Based on the provided information, I will attempt to help you locate the issue and provide guidance on how to design for safe operating conditions.

Analysis of the issue

The unexpected power-offs of the Jetson AGX Orin 64GB Devkit units during testing suggest that there might be a thermal or power-related issue. The kernel log before the power loss indicates a thermal event, which could be related to high resource usage leading to overcurrent or temperature problems that trigger hardware protection.

Thermal Management

The Jetson AGX Orin series has a thermal management system that monitors the temperature of the module and takes actions to prevent overheating. The system has multiple thermal zones, each with its own temperature sensor. The thermal management system can throttle the CPU and GPU clocks, reduce the power consumption, or even shut down the system if the temperature exceeds a certain threshold.

Thermal Specifications

The thermal specifications for the Jetson AGX Orin series are not explicitly mentioned in the provided context. However, according to the NVIDIA documentation, the Jetson AGX Orin series has a maximum operating temperature of 105°C (221°F).

Power Management

The Jetson AGX Orin series has a power management system that monitors the power consumption of the module and takes actions to prevent overcurrent. The system has multiple power rails, each with its own current limit. If the current exceeds the limit, the system can throttle the CPU and GPU clocks, reduce the power consumption, or even shut down the system.

Safety Usage Thresholds

To design for safe operating conditions, you should ensure that the system operates within the recommended temperature and power consumption ranges. The recommended temperature range for the Jetson AGX Orin series is between 0°C (32°F) and 105°C (221°F). The recommended power consumption range is not explicitly mentioned in the provided context, but it is typically around 40W for the Jetson AGX Orin 64GB Devkit.

Design for Safe Operating Conditions

To prevent unexpected power-offs and ensure safe operating conditions, you can take the following steps:

  1. Monitor temperature and power consumption: Use the thermal management system and power management system to monitor the temperature and power consumption of the module. You can use tools like tmon and powerctl to monitor the temperature and power consumption.
  2. Throttle CPU and GPU clocks: If the temperature or power consumption exceeds the recommended threshold, throttle the CPU and GPU clocks to reduce the power consumption and prevent overheating.
  3. Use a heat sink or fan: Consider using a heat sink or fan to improve the cooling of the module.
  4. Ensure proper airflow: Ensure that there is proper airflow around the module to prevent overheating.
  5. Use a power supply with sufficient capacity: Ensure that the power supply can provide sufficient power to the module. In this case, the power supply can provide up to 16.45A, which should be sufficient.

Comparison with other Jetson modules

The Jetson AGX Orin series has similar power management and thermal management features as other Jetson modules, such as the Jetson Xavier NX series and Jetson Nano series. However, the specific thermal and power management configurations may vary depending on the module.

In conclusion, the unexpected power-offs of the Jetson AGX Orin 64GB Devkit units during testing are likely related to thermal or power-related issues. To design for safe operating conditions, you should monitor temperature and power consumption, throttle CPU and GPU clocks, use a heat sink or fan, ensure proper airflow, and use a power supply with sufficient capacity.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

May I kno which JetPack SW you’re using?

JetPack 6.2

please use uart console to monitor and then reproduce issue and share the full log from uart and also tegrastats result.

关机了就没有log了

所以才說使用UART log抓關機之前的這些資訊…
就算停掉了我們也能知道關機前一刻發生了什麼

SYSLOG.txt (798.7 KB)
这个是我们记录的关机前的syslog,看有帮助吗

沒有幫助. 所以才請你照著我們說的做…

如果你真的不知道我們需要什麼資訊我可以再用中文複述/解釋一次.

好的,主要是不定期的复现。不是每次必现,所以不好抓,我们会尝试抓

您好,我们看到了4个串口,应该用哪个?以及串口速率应该配置为多少?

please follow

打不开这个链接,看看是怎么回事儿?

maybe check if you need a VPN to open it.

整机组装好之后,不方便拆壳插micro USB线。有没有办法把串口log存在文件里,下次开机后可以读出来?这会对我们帮助很大

The UART log is stored to your host PC but not on Jetson.
Jetson side is not possible to save them when it goes power off.

整机上的线不好接,有什么办法比如串口能否直接存储在jetson里?

No, there is no other method.


您好,我们在使用过程中,用JTOP界面观察到了发现了OC的告警,但没有触发关机,资源情况如图所示。想咨询一下OC如何理解,以及影响后果。我的理解,OC的产生条件是当系统负载过重,2个采集电流的传感器INA3221(第一个采集CPU、GPU、总5V;第二个采集DDR,IO,VAO)采集到的系统总瞬时电流 超过了这款开发套件 最大的瞬时功率TDP(65W)限制,会报出来的,会引起硬件节流降低频率,但不会导致关机。有没有可能是功率严重超过,导致系统的软件风扇控制、软件降频、硬件降频、软件关机等策略没有响应的时间,直接导致硬件关机。

HI,

我們前面跟你索要的資料就是要確認你們問題發生時的溫度還有功耗狀況.
這些資料沒有給的話說實話沒有辦法判斷.

你對OC throttling的理解是對的, 這部份只會造成thorttling但不會造成關機. 會造成關機比較有可能是過熱的部份.