Thor (T5000) Data Abort (Exception Type 0) triggered at Tj > 80°C on Custom Carrier Board

Hi NVIDIA team,

I am developing a custom board for Thor T5000 module. I am encountering a system freeze (OS hang) specifically when the (tj-thermal) reaches 80°C. However, the CPU remains alive as the debug console is still responsive, but it reports a Data abort exception.

Exception Log:

Exception Type: 0
DFAR: 0xd9684094 DFSR: 0x00001008 ADFSR: 0x00500000
IFAR: 0x00000000 IFSR: 0x00000000 AIFSR: 0x00000000
PC: 0x00026f4e LR: 0x00026f4d SP: 0x0007ded0 PSR: 0x2000003f
R0: 0x00000001 R1: 0x40084738 R2: 0x000000c6 R3: 0xd9680000
R4: 0x00041050 R5: 0x00000005 R6: 0x00004094 R7: 0x0007ded0
R8: 0x00000004 R9: 0x00000003 R10: 0x00000000 R11: 0x41c956c4
R12: 0x408ac701

Testing & Observations:

  1. Stress Test : Under high load, once tj-thermal > 80°C, the console show Exception Type: 0. The system remains partially alive, but any mouse click/keyboard input immediately leads to OS freeze. But if the fan speed is increased to keep Tj < 80, the system will not freeze.
  2. Idle Heating Test: Even in an idle state (OS did’t ran any thing), using a heat gun to bring tj-thermal to 80°C immediately triggers the same exception.
  3. Power Rails: Monitored 5V and 12V input rails on the carrier board; no significant voltage drops during the crash.
  4. Cross-Validation: Using the same module, heatsink, and OS (M.2) on the NVIDIA Official DevKit, the system works perfectly even when the temperature exceeds 80°C. The issue only occurs on our board.

Questions:

  1. What specific hardware or firmware sequence is triggered exactly at tj-thermal = 80°C? (e.g., BPMP frequency scaling, voltage VID change, or specific I2C polling?)
  2. Does the address 0xd9684094 point to a specific internal bus or peripheral register that might be sensitive to signal integrity or ground bounce during thermal throttling?
  3. How can I solve this issue? Any ideas?

thanks.

0327_03.txt (258.4 KB)

0327_01.txt (157.4 KB)

0327_02.txt (212.4 KB)
debug console log

Hi,

I think you can start by checking the tj-thermal configuration in the device tree. It is part of the thermal-zones node. Please check the tj-sw-shutdown entry as well, since the temperature value may have been changed.

tree /proc/device-tree/thermal-zones/tj-thermal/trip/
/proc/device-tree/thermal-zones/tj-thermal/trips/tj-sw-shutdown/
|-- hysteresis
|-- name
|-- phandle
|-- temperature
-- type

Regards,
Manuel Leiva
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.com
Website: www.ridgerun.com

Thor Termal docs

ls /sys/class/thermal/thermal_zone0/
available_policies  cdev1_trip_point  cdev3             k_d   offset     sustainable_power  trip_point_1_hyst  trip_point_2_type  trip_point_4_temp
cdev0               cdev1_weight      cdev3_trip_point  k_i   policy     temp               trip_point_1_temp  trip_point_3_hyst  trip_point_4_type
cdev0_trip_point    cdev2             cdev3_weight      k_po  power      trip_point_0_hyst  trip_point_1_type  trip_point_3_temp  type
cdev0_weight        cdev2_trip_point  emul_temp         k_pu  slope      trip_point_0_temp  trip_point_2_hyst  trip_point_3_type  uevent
cdev1               cdev2_weight      integral_cutoff   mode  subsystem  trip_point_0_type  trip_point_2_temp  trip_point_4_hyst
grep "" /sys/class/thermal/thermal_zone*/trip_point_*
/sys/class/thermal/thermal_zone0/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_0_temp:114500
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:active
/sys/class/thermal/thermal_zone0/trip_point_2_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_2_temp:86000
/sys/class/thermal/thermal_zone0/trip_point_2_type:active
/sys/class/thermal/thermal_zone0/trip_point_3_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_3_temp:91000
/sys/class/thermal/thermal_zone0/trip_point_3_type:active
/sys/class/thermal/thermal_zone0/trip_point_4_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_4_temp:100000
/sys/class/thermal/thermal_zone0/trip_point_4_type:active
/sys/class/thermal/thermal_zone1/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone1/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone1/trip_point_0_type:passive
/sys/class/thermal/thermal_zone1/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone1/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone1/trip_point_1_type:critical
/sys/class/thermal/thermal_zone2/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone2/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone2/trip_point_0_type:passive
/sys/class/thermal/thermal_zone2/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone2/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone2/trip_point_1_type:critical
/sys/class/thermal/thermal_zone3/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone3/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone3/trip_point_0_type:passive
/sys/class/thermal/thermal_zone3/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone3/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone3/trip_point_1_type:critical
/sys/class/thermal/thermal_zone4/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone4/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone4/trip_point_0_type:passive
/sys/class/thermal/thermal_zone4/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone4/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone4/trip_point_1_type:critical

Exception Type: 0
DFAR: 0xd9684094 DFSR: 0x00001008 ADFSR: 0x00500000
IFAR: 0x00000000 IFSR: 0x00000000 AIFSR: 0x00000000
PC: 0x00026f4e LR: 0x00026f4d SP: 0x0007ded0 PSR: 0x2000003f
R0: 0x00000001 R1: 0x40084738 R2: 0x000000c6 R3: 0xd9680000
R4: 0x00041050 R5: 0x00000005 R6: 0x00004094 R7: 0x0007ded0
R8: 0x00000004 R9: 0x00000003 R10: 0x00000000 R11: 0x41c956c4
R12: 0x408ac701

This error log is related to display DCE crash.

I think is more related to the stress test but not with the thermal problem.

For example, if you put the device into a chamber near 80C, but without running stress test, I believe you won’t hit above issue.

Hi @WayneWWW ,
不好意思,我用中文回覆可會比較清楚, 我認為這個問題跟溫度有關,請參考以下實驗log.

  1. 2026_0331_1 , 我執行了stress test 並且將風扇轉速拉高, 讓tj-thermal < 80度, 連續燒機約3hr都沒有發生OS Freeze的狀況
  2. 2026_0331_2, 我做了跟2026_0331_1依樣的實驗, 只是這次我沒有拉高風扇的轉速, 因此tj-thermal 很快就超過80度, log就跳出Exception Type: 02的訊息, 這時候在OS操作點擊滑鼠的動作, OS馬上就發生freeze的狀況, 但debug console 還可以繼續使用.
  3. 2026_0331_3, 這次系統開進OS之後,沒有執行任何程式 , 我單純使用吹風機對散熱片加熱, 加熱至快到80度的時候, log就跳出Exception Type: 02的訊息, 這時候在OS操作點擊滑鼠的動作, OS馬上就發生freeze的狀況, 但debug console 還可以繼續使用.

stress test : stress –cpu $cpu (noproc) & ./matrixMulCUBLAS ; power mode : 120W

我有拿NVIDIA的Carrier board交叉測試, 同樣的M.2 SSD , module 以及散熱模組放到NVIDIA的Carrier board上, 即使溫度到了90或100度, 都不會發生Exception Type: 02的log, OS也沒有發生freeze的狀況.我想了解module會去偵測Carrier board上的溫度或是甚麼訊號嗎? 或是有任何想法?

另外,你提到 error log is related to display DCE crash , 這部分我可以去改善哪裡? 或是朝甚麼方向去debug?

thanks,

2026_0331_1.txt (6.2 MB)

2026_0331_2.txt (312.0 KB)

2026_0331_3.txt (311.4 KB)

Hi,

Thanks for the suggestion. I checked the tj-sw-shutdown value and it is 0x0001BF44, which corresponds to 114.5°C. The type is indeed critical.

想請問一下你所謂的 "OS freeze"應該其實說的是類似GUI 上面無法操作的問題嗎?

因為debug console還能用的話OS其實還正常.

請問你們的底板上display的設定跟NV devkit有一樣嗎?

HI @WayneWWW

  1. 當狀況發生時,在ubuntu桌面可以移動滑鼠, power gui上的數字也會跳動, 但只要點擊滑鼠左右鍵, 畫面馬上就停止不更新了, 滑鼠,鍵盤,畫面都無反應.
  2. 板子上display的設定跟NV devkit不一樣, DP0 : HDMI , DP1 :HDMI , DP2 :HDMI , 但我在實驗的時候只有接DP1這部分跟NV devkit依樣.
  3. 我有個疑問display的設定應該是燒入在module上對嗎?那我更換carrier board應該不影響設定.

Hi elvis,

能否做個實驗確認一下, 先把nvidia.ko /nvidia-modeset.ko unload (螢幕會不能用)
然後做stress test確認一下是否問題就不會發生

Hi @elvis7

也想跟你確認一下你們的kernel DT還有dcb 上的設定

HI @WayneWWW ,

我作了以下4個實驗,

  1. 不接螢幕開機,然後執行stress test,tj>80 log沒有出現Exception Type: 0 ,但點擊滑鼠會出現以下errot log:[ 1426.895005] dce: dce_ipc_send_message:469 Error getting next free buf to write
    [ 1426.895036] dce: dce_ipc_send_message_sync:546 Error in sending message to DCE
    [ 1426.895752] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c97e:6:0:0xffffffe4, 插上螢幕OS畫面沒有freeze , 但執行關機指令CARRIER_POWER_ON無法拉low, 然後持續出現error log, 請參考log 2026_0331_4_no_monitor
  2. 不接螢幕開機,然後執行stress test,log出現Exception Type: 0,請參考 log 2026_0331_6_no_monitor
  3. 依照你說的先把nvidia.ko /nvidia-modeset.ko unload , 我執行附件command關掉display driver,
    但因為關掉display driver就無法執行./matrixMulCUBLAS的GPU stress test, 所以我只有執行stress --cpu $(nproc), 並且用吹風機加熱tj>80, log沒有顯示Exception Type: 0, 等溫度下降再重新啟動圖形桌面服務 (GDM)讓螢幕出現畫面,OS沒有出現freeze, 請參考log 2026_0331_5_ko
  4. nvidia.ko /nvidia-modeset.ko unload , 溫度超過tj>80後,再重新啟動圖形桌面服務 (GDM)讓螢幕出現畫面,然後執行GPU stress test 沒有出現Exception Type: 0 ,但關掉 stress test 後, 溫度掉下來了之後卻出現了Exception Type: 0,接著點擊滑鼠OS freeze. 請參考log 2026_0331_8_ko.

2026_0331_4_no_monitor.txt (904.1 KB)

2026_0331_6_no_monitor.txt (328.2 KB)

2026_0331_5_ko.txt (428.4 KB)

2026_0331_8_ko.txt (939.5 KB)

2026_0401_command.txt (2.6 KB)

dyna_pcb_dts.zip (47.4 KB)

dcb_hdmi.txt (1.6 KB)

Hi elvis7,

也想請問一下, 你們三個hdmi是能同時一起用的嗎?
做這個測試的時候是單一螢幕還是多個螢幕?

HI @WayneWWW ,

我做這個實驗是使用單一螢幕 , 但最終產品會是三個螢幕一起使用.

請問如果把dcb設定改到跟NV devkit類似的情況 (只有HDMI on DP1), 那麼是不是無法複製出問題?

我可以試試看, 但我把module , m.2 os and 散熱模組整個換到NV devkit上就不會發生, 我的dcb設定在NV devkit上會被更改嗎?

DCB不會被更改. 但你在devkit上的display hardware跟你的板子不一樣. 也會造成影響

HI @WayneWWW ,

我試了dcb和devkit依樣,還是會發生,請參考log

dcb_dp_hdmi.txt (1.5 KB)

2026_0401_1.txt (451.4 KB)

請問你除了dcb之外還有設定hpd pinmux跟 os_gpio_hotplug在 DT嗎

HI @WayneWWW .

是的有關display的pin mix(hpd /i2c)都設定的跟devkit依樣, 但還是會發生.