Jetson Orin NX Cold Boot Failure at -20°C — SOC_THERM Reports Invalid Temperature

Hi,

We are experiencing a cold boot failure on the Jetson Orin NX at -20°C ambient temperature. The ambient temperature has been verified with an external sensor.

We have applied NVIDIA’s critical QSPI software update, but the issue persists.

Issue:

The MB1 bootloader fails during the thermal check with the following error:

[0000.120] I> Task: Thermal check (0x50021d55)
[0000.121] I> max_chip_limit = 105
[0000.122] I> min_chip_limit = -28
[0000.123] I> max temp read = -20
[0000.124] I> min temp read = -33
[0000.125] E> SOC_THERM: Failed to verify temp range.
[0000.126] C> Task 0x0 failed (err: 0x5f5f030b)
[0000.127] E> Top caller module: SOC_THERM, error module: SOC_THERM, reason: 0x0b, aux_info: 0x03

The module then enters a boot loop (“Busy Spin”).

Analysis:

The SOC_THERM sensors report a minimum temperature of -33°C, which is impossible given the verified ambient temperature of -20°C. Since the reported value falls below the firmware’s hard-coded limit of -28°C, the boot process is terminated.

This appears to be a sensor calibration error at low temperatures rather than an actual temperature reading. The Orin NX is rated to operate at Tj minimum of -25°C, so we are within the specified operating range, yet the erroneous sensor reading prevents boot.

Additional Observation:

We have another Orin NX module flashed with the identical image and tested under the same conditions. That unit reports temperatures approximately 5°C lower than the failing unit. This suggests significant unit-to-unit variation in thermal sensor calibration, which may cause some modules to fail cold boot while others pass.

Questions:

  1. Is this a known issue with SOC_THERM calibration at low temperatures?

  2. Is there a way to adjust the min_chip_limit parameter via MB1 BCT configuration?

  3. Can NVIDIA provide a firmware update or workaround for this issue?

  4. Is the unit-to-unit thermal sensor variation within expected tolerances?

Environment:

  • Module: Jetson Orin NX

  • JetPack version: 5.1.1

  • Ambient temperature: -20°C (verified)

  • QSPI patch: Applied

  • Carrier: Forecr - NVIDIA® Jetson Orin™ NX & Orin Nano™ Dual LAN Carrier Board - DSBOARD-ORNX-LAN

Full boot log attached.

Thank you.

crash_log_1152_040126.txt (26.8 KB)

*** Please note that this reply is generated by LLM automatically ***

The issue you are experiencing with the Jetson Orin NX module is related to the thermal sensor calibration at low temperatures. The SOCTHERM sensors are reporting a minimum temperature of 33C, which is higher than the actual ambient temperature of 20C. This is causing the boot process to terminate due to the hardcoded limit of 28C.

This issue appears to be a sensor calibration error at low temperatures, rather than an actual temperature reading. The Orin NX is rated to operate at a Tj minimum of 25C, so you are within the specified operating range.

The unit-to-unit variation in thermal sensor calibration is significant, as you have observed that another Orin NX module with the same image and conditions reports temperatures approximately 5C lower than the failing unit. This suggests that some modules may fail cold boot while others pass.

Regarding the SOCTHERM calibration at low temperatures, it is a known issue that the thermal sensors may not be accurate at low temperatures. However, NVIDIA does not provide a direct way to adjust the parameter via MB1 BCT configuration.

As for a firmware update or workaround, NVIDIA may provide a future update to address this issue. However, in the meantime, you can try the following workarounds:

  1. Increase the ambient temperature: If possible, try increasing the ambient temperature to a value above the hardcoded limit of 28C. This may allow the module to boot successfully.
  2. Use a different thermal sensor: If you have access to a different thermal sensor, you can try using it to measure the temperature. This may provide a more accurate reading and allow the module to boot successfully.
  3. Modify the firmware: If you have the necessary expertise and equipment, you can try modifying the firmware to adjust the thermal sensor calibration. However, this is not recommended as it may void the warranty and cause other issues.

Regarding the unit-to-unit thermal sensor variation, it is within expected tolerances. The thermal sensors are calibrated to provide accurate readings within a certain range, but there may be some variation between units.

In summary, the issue you are experiencing is related to the thermal sensor calibration at low temperatures, and it is a known issue. While there is no direct way to adjust the parameter via MB1 BCT configuration, you can try the workarounds mentioned above. NVIDIA may provide a future update to address this issue.

Recommendations:

  1. Contact NVIDIA support for further assistance and to report the issue.
  2. Try the workarounds mentioned above to see if they resolve the issue.
  3. Consider using a different thermal sensor or modifying the firmware (if possible) to adjust the thermal sensor calibration.
  4. Keep an eye on future updates from NVIDIA that may address this issue.

Note: The information provided is based on the context and may not be comprehensive or up-to-date. It is recommended to consult the official NVIDIA documentation and support channels for the most accurate and up-to-date information.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

We have a overlay patch on rel-35.3.1 for this temperature issue.

Jetson Linux 35.3.1 | NVIDIA Developer

Please apply it. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.