elvis7
March 27, 2026, 10:40am
1
Hi NVIDIA team,
I am developing a custom board for Thor T5000 module. I am encountering a system freeze (OS hang) specifically when the (tj-thermal) reaches 80°C. However, the CPU remains alive as the debug console is still responsive, but it reports a Data abort exception.
Exception Log:
Exception Type: 0
DFAR: 0xd9684094 DFSR: 0x00001008 ADFSR: 0x00500000
IFAR: 0x00000000 IFSR: 0x00000000 AIFSR: 0x00000000
PC: 0x00026f4e LR: 0x00026f4d SP: 0x0007ded0 PSR: 0x2000003f
R0: 0x00000001 R1: 0x40084738 R2: 0x000000c6 R3: 0xd9680000
R4: 0x00041050 R5: 0x00000005 R6: 0x00004094 R7: 0x0007ded0
R8: 0x00000004 R9: 0x00000003 R10: 0x00000000 R11: 0x41c956c4
R12: 0x408ac701
Testing & Observations:
Stress Test : Under high load, once tj-thermal > 80°C, the console show Exception Type: 0. The system remains partially alive, but any mouse click/keyboard input immediately leads to OS freeze. But if the fan speed is increased to keep Tj < 80 , the system will not freeze.
Idle Heating Test: Even in an idle state (OS did’t ran any thing), using a heat gun to bring tj-thermal to 80°C immediately triggers the same exception.
Power Rails: Monitored 5V and 12V input rails on the carrier board; no significant voltage drops during the crash.
Cross-Validation: Using the same module, heatsink, and OS (M.2) on the NVIDIA Official DevKit, the system works perfectly even when the temperature exceeds 80°C. The issue only occurs on our board.
Questions:
What specific hardware or firmware sequence is triggered exactly at tj-thermal = 80°C? (e.g., BPMP frequency scaling, voltage VID change, or specific I2C polling?)
Does the address 0xd9684094 point to a specific internal bus or peripheral register that might be sensitive to signal integrity or ground bounce during thermal throttling?
How can I solve this issue? Any ideas?
thanks.
elvis7
March 27, 2026, 10:42am
2
0327_03.txt (258.4 KB)
0327_01.txt (157.4 KB)
0327_02.txt (212.4 KB)
debug console log
Hi,
I think you can start by checking the tj-thermal configuration in the device tree. It is part of the thermal-zones node. Please check the tj-sw-shutdown entry as well, since the temperature value may have been changed.
tree /proc/device-tree/thermal-zones/tj-thermal/trip/
/proc/device-tree/thermal-zones/tj-thermal/trips/tj-sw-shutdown/
|-- hysteresis
|-- name
|-- phandle
|-- temperature
-- type
Regards,
Manuel Leiva
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.com
Website: www.ridgerun.com
Thor Termal docs
ls /sys/class/thermal/thermal_zone0/
available_policies cdev1_trip_point cdev3 k_d offset sustainable_power trip_point_1_hyst trip_point_2_type trip_point_4_temp
cdev0 cdev1_weight cdev3_trip_point k_i policy temp trip_point_1_temp trip_point_3_hyst trip_point_4_type
cdev0_trip_point cdev2 cdev3_weight k_po power trip_point_0_hyst trip_point_1_type trip_point_3_temp type
cdev0_weight cdev2_trip_point emul_temp k_pu slope trip_point_0_temp trip_point_2_hyst trip_point_3_type uevent
cdev1 cdev2_weight integral_cutoff mode subsystem trip_point_0_type trip_point_2_temp trip_point_4_hyst
grep "" /sys/class/thermal/thermal_zone*/trip_point_*
/sys/class/thermal/thermal_zone0/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_0_temp:114500
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:active
/sys/class/thermal/thermal_zone0/trip_point_2_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_2_temp:86000
/sys/class/thermal/thermal_zone0/trip_point_2_type:active
/sys/class/thermal/thermal_zone0/trip_point_3_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_3_temp:91000
/sys/class/thermal/thermal_zone0/trip_point_3_type:active
/sys/class/thermal/thermal_zone0/trip_point_4_hyst:0
/sys/class/thermal/thermal_zone0/trip_point_4_temp:100000
/sys/class/thermal/thermal_zone0/trip_point_4_type:active
/sys/class/thermal/thermal_zone1/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone1/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone1/trip_point_0_type:passive
/sys/class/thermal/thermal_zone1/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone1/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone1/trip_point_1_type:critical
/sys/class/thermal/thermal_zone2/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone2/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone2/trip_point_0_type:passive
/sys/class/thermal/thermal_zone2/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone2/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone2/trip_point_1_type:critical
/sys/class/thermal/thermal_zone3/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone3/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone3/trip_point_0_type:passive
/sys/class/thermal/thermal_zone3/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone3/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone3/trip_point_1_type:critical
/sys/class/thermal/thermal_zone4/trip_point_0_hyst:0
/sys/class/thermal/thermal_zone4/trip_point_0_temp:109000
/sys/class/thermal/thermal_zone4/trip_point_0_type:passive
/sys/class/thermal/thermal_zone4/trip_point_1_hyst:0
/sys/class/thermal/thermal_zone4/trip_point_1_temp:114500
/sys/class/thermal/thermal_zone4/trip_point_1_type:critical
Exception Type: 0
DFAR: 0xd9684094 DFSR: 0x00001008 ADFSR: 0x00500000
IFAR: 0x00000000 IFSR: 0x00000000 AIFSR: 0x00000000
PC: 0x00026f4e LR: 0x00026f4d SP: 0x0007ded0 PSR: 0x2000003f
R0: 0x00000001 R1: 0x40084738 R2: 0x000000c6 R3: 0xd9680000
R4: 0x00041050 R5: 0x00000005 R6: 0x00004094 R7: 0x0007ded0
R8: 0x00000004 R9: 0x00000003 R10: 0x00000000 R11: 0x41c956c4
R12: 0x408ac701
This error log is related to display DCE crash.
I think is more related to the stress test but not with the thermal problem.
For example, if you put the device into a chamber near 80C, but without running stress test, I believe you won’t hit above issue.
elvis7
March 31, 2026, 7:10am
7
WayneWWW:
This error log is related to display DCE crash.
I think is more related to the stress test but not with the thermal problem.
For example, if you put the device into a chamber near 80C, but without running stress test, I believe you won’t hit above issue.
Hi @WayneWWW ,
不好意思,我用中文回覆可會比較清楚, 我認為這個問題跟溫度有關,請參考以下實驗log.
2026_0331_1 , 我執行了stress test 並且將風扇轉速拉高, 讓tj-thermal < 80度, 連續燒機約3hr都沒有發生OS Freeze的狀況
2026_0331_2, 我做了跟2026_0331_1依樣的實驗, 只是這次我沒有拉高風扇的轉速, 因此tj-thermal 很快就超過80度, log就跳出Exception Type: 02的訊息, 這時候在OS操作點擊滑鼠的動作, OS馬上就發生freeze的狀況, 但debug console 還可以繼續使用.
2026_0331_3, 這次系統開進OS之後,沒有執行任何程式 , 我單純使用吹風機對散熱片加熱, 加熱至快到80度的時候, log就跳出Exception Type: 02的訊息, 這時候在OS操作點擊滑鼠的動作, OS馬上就發生freeze的狀況, 但debug console 還可以繼續使用.
stress test : stress –cpu $cpu (noproc) & ./matrixMulCUBLAS ; power mode : 120W
我有拿NVIDIA的Carrier board交叉測試, 同樣的M.2 SSD , module 以及散熱模組放到NVIDIA的Carrier board上, 即使溫度到了90或100度, 都不會發生Exception Type: 02的log, OS也沒有發生freeze的狀況.我想了解module會去偵測Carrier board上的溫度或是甚麼訊號嗎? 或是有任何想法?
另外,你提到 error log is related to display DCE crash , 這部分我可以去改善哪裡? 或是朝甚麼方向去debug?
thanks,
2026_0331_1.txt (6.2 MB)
2026_0331_2.txt (312.0 KB)
2026_0331_3.txt (311.4 KB)
elvis7
March 31, 2026, 7:12am
8
Hi,
Thanks for the suggestion. I checked the tj-sw-shutdown value and it is 0x0001BF44, which corresponds to 114.5°C . The type is indeed critical.
想請問一下你所謂的 "OS freeze"應該其實說的是類似GUI 上面無法操作的問題嗎?
因為debug console還能用的話OS其實還正常.
請問你們的底板上display的設定跟NV devkit有一樣嗎?
Hi elvis,
能否做個實驗確認一下, 先把nvidia.ko /nvidia-modeset.ko unload (螢幕會不能用)
然後做stress test確認一下是否問題就不會發生
Hi @elvis7
也想跟你確認一下你們的kernel DT還有dcb 上的設定
elvis7
April 1, 2026, 6:15am
13
HI @WayneWWW ,
我作了以下4個實驗,
不接螢幕開機,然後執行stress test,tj>80 log沒有出現Exception Type: 0 ,但點擊滑鼠會出現以下errot log:[ 1426.895005] dce: dce_ipc_send_message:469 Error getting next free buf to write
[ 1426.895036] dce: dce_ipc_send_message_sync:546 Error in sending message to DCE
[ 1426.895752] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c97e:6:0:0xffffffe4, 插上螢幕OS畫面沒有freeze , 但執行關機指令CARRIER_POWER_ON無法拉low, 然後持續出現error log, 請參考log 2026_0331_4_no_monitor
不接螢幕開機,然後執行stress test,log出現Exception Type: 0,請參考 log 2026_0331_6_no_monitor
依照你說的先把nvidia.ko /nvidia-modeset.ko unload , 我執行附件command關掉display driver,
但因為關掉display driver就無法執行./matrixMulCUBLAS的GPU stress test, 所以我只有執行stress --cpu $(nproc), 並且用吹風機加熱tj>80, log沒有顯示Exception Type: 0, 等溫度下降再重新啟動圖形桌面服務 (GDM)讓螢幕出現畫面,OS沒有出現freeze, 請參考log 2026_0331_5_ko
把nvidia.ko /nvidia-modeset.ko unload , 溫度超過tj>80後,再重新啟動圖形桌面服務 (GDM)讓螢幕出現畫面,然後執行GPU stress test 沒有出現Exception Type: 0 ,但關掉 stress test 後, 溫度掉下來了之後卻出現了Exception Type: 0,接著點擊滑鼠OS freeze. 請參考log 2026_0331_8_ko.
2026_0331_4_no_monitor.txt (904.1 KB)
2026_0331_6_no_monitor.txt (328.2 KB)
2026_0331_5_ko.txt (428.4 KB)
2026_0331_8_ko.txt (939.5 KB)
2026_0401_command.txt (2.6 KB)
dyna_pcb_dts.zip (47.4 KB)
dcb_hdmi.txt (1.6 KB)
Hi elvis7,
也想請問一下, 你們三個hdmi是能同時一起用的嗎?
做這個測試的時候是單一螢幕還是多個螢幕?
elvis7
April 1, 2026, 6:28am
15
HI @WayneWWW ,
我做這個實驗是使用單一螢幕 , 但最終產品會是三個螢幕一起使用.
請問如果把dcb設定改到跟NV devkit類似的情況 (只有HDMI on DP1), 那麼是不是無法複製出問題?
elvis7
April 1, 2026, 6:48am
17
我可以試試看, 但我把module , m.2 os and 散熱模組整個換到NV devkit上就不會發生, 我的dcb設定在NV devkit上會被更改嗎?
DCB不會被更改. 但你在devkit上的display hardware跟你的板子不一樣. 也會造成影響
elvis7
April 1, 2026, 9:05am
19
HI @WayneWWW ,
我試了dcb和devkit依樣,還是會發生,請參考log
dcb_dp_hdmi.txt (1.5 KB)
2026_0401_1.txt (451.4 KB)
請問你除了dcb之外還有設定hpd pinmux跟 os_gpio_hotplug在 DT嗎
elvis7
April 2, 2026, 5:40am
21
HI @WayneWWW .
是的有關display的pin mix(hpd /i2c)都設定的跟devkit依樣, 但還是會發生.