A serious gpu error occurred in jp5.1 orin nx

Our customer was using the customized board to produce. The device suddenly shutted down after running for about 3 hours.The following log is the serial port monitoring log. It was quite urgent. Please help analyze it.
[11575.119284] nvgpu: 17000000.ga10b ga10b_intr_log_pending_intrs:306 [ERR] Pending TOP[0]: 0x00000004, LEAF[4]: 0x11000000
[11575.132318] arm-smmu 8000000.iommu: disabling translation
[11575.137952] arm-smmu 10000000.iommu: disabling translation
[11575.143605] arm-smmu 12000000.iommu: disabling translation
[11575.179064] CPU1: shutdown
[11575.202913] CPU2: shutdown
[11575.222903] CPU3: shutdown
[11575.270829] CPU4: shutdown
[11575.306822] CPU5: shutdown
[11575.354792] CPU6: shutdown
[11575.375430] IRQ 120: no longer affine to CPU7
[11575.380319] CPU7: shutdown
[11575ÿäÿá.383939] rebÿâShutdown state requested 0
Shutting down system …
ÿáoot: Power down
ÿâ
kern.log (220.5 KB)

There is no useful log from your kern.log and also your UART log looks just like a regular shutdown log.

I don’t see anything to check here.

[11575.119284] nvgpu: 17000000.ga10b ga10b_intr_log_pending_intrs:306 [ERR] Pending TOP[0]: 0x00000004, LEAF[4]: 0x11000000
Isn’t the above log an error?

Will the below kernel log not affect it?
May 8 14:00:02 localhost kernel: [11569.526000] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
May 8 14:00:02 localhost kernel: [11569.526004] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
May 8 14:00:02 localhost kernel: [11569.526012] CPU: 1 PID: 2072 Comm: Xorg Tainted: G OE 5.10.104-tegra #1
May 8 14:00:02 localhost kernel: [11569.526013] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS r35.3.1-5e812e4-dirty 11/05/2023
May 8 14:00:02 localhost kernel: [11569.526015] Call trace:
May 8 14:00:02 localhost kernel: [11569.526025] dump_backtrace+0x0/0x1d0
May 8 14:00:02 localhost kernel: [11569.526028] show_stack+0x30/0x40
May 8 14:00:02 localhost kernel: [11569.526033] dump_stack+0xd8/0x138
May 8 14:00:02 localhost kernel: [11569.526082] os_dump_stack+0x18/0x20 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526123] tlsEntryGet+0x130/0x138 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526161] gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526200] threadPriorityStateFree+0x234/0x2a0 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526238] RmShutdownAdapter+0x168/0x268 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526277] rm_shutdown_adapter+0x50/0x70 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526315] nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526353] nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526391] nvidia_dev_put+0x38/0xc78 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526426] nvkms_close_gpu+0x60/0x98 [nvidia_modeset]
May 8 14:00:02 localhost kernel: [11569.526453] nvRmFreeDeviceEvo+0x8c/0x130 [nvidia_modeset]
May 8 14:00:02 localhost kernel: [11569.526477] nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
May 8 14:00:02 localhost kernel: [11569.526516] nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
May 8 14:00:02 localhost kernel: [11569.526521] __arm64_sys_ioctl+0xac/0xf0
May 8 14:00:02 localhost kernel: [11569.526524] el0_svc_common.constprop.0+0x80/0x1d0
May 8 14:00:02 localhost kernel: [11569.526526] do_el0_svc+0x38/0xb0
May 8 14:00:02 localhost kernel: [11569.526530] el0_svc+0x1c/0x30
May 8 14:00:02 localhost kernel: [11569.526531] el0_sync_handler+0xa8/0xb0
May 8 14:00:02 localhost kernel: [11569.526533] el0_sync+0x16c/0x180

These are the actual logs we caught. Do you think it is a soft shutdown?

我直接用中文回應可能比較有效率一些…

[11575.119284] nvgpu: 17000000.ga10b ga10b_intr_log_pending_intrs:306 [ERR] Pending TOP[0]: 0x00000004, LEAF[4]: 0x11000000
Isn’t the above log an error?

沒有. 你貼的整段log看起來就是一個一般的關機流程..

基本上軟體造成的問題不會讓機器進行shutdown… 頂多就是reboot.
如果你真的看到機器關了, 通常都是熱當或是供電問題所導致… 這種狀況下系統也不會給你印什麼shutdown log…

过热我用tegrastats做了监控,并没有过热,如果是供电问题的话,其他机器也没出现啊,目前就出现一台,开机等几个个小时就关机了

真的有問題的log應該是這一段

May 8 13:40:37 localhost kernel: [10404.800925] cpufreq: cpu0,cur:2114000,set:1984000,set ndiv:155
May 8 13:40:43 localhost kernel: [10410.888433] cpufreq: cpu0,cur:1827000,set:1984000,set ndiv:155
May 8 13:40:47 localhost kernel: [10414.945767] cpufreq: cpu0,cur:2249000,set:1984000,set ndiv:155
May 8 13:40:50 localhost kernel: [10417.988603] cpufreq: cpu0,cur:2161000,set:1984000,set ndiv:155
May 8 13:40:52 localhost kernel: [10420.018515] cpufreq: cpu0,cur:2138000,set:1984000,set ndiv:155
May 8 13:41:02 localhost kernel: [10429.151348] cpufreq: cpu4,cur:1857000,set:1984000,set ndiv:155
May 8 13:41:03 localhost kernel: [10430.163225] cpufreq: cpu0,cur:1857000,set:1984000,set ndiv:155
May 8 13:41:14 localhost kernel: [10441.323221] cpufreq: cpu0,cur:1820000,set:1984000,set ndiv:155
May 8 13:41:18 localhost kernel: [10445.383440] cpufreq: cpu4,cur:1857000,set:1984000,set ndiv:155
May 8 13:41:22 localhost kernel: [10449.439329] cpufreq: cpu0,cur:1865000,set:1984000,set ndiv:155
May 8 13:41:26 localhost kernel: [10453.496399] cpufreq: cpu0,cur:2123000,set:1984000,set ndiv:155
May 8 13:41:31 localhost kernel: [10458.568494] cpufreq: cpu0,cur:1787000,set:1984000,set ndiv:155
May 8 13:41:33 localhost kernel: [10460.597618] cpufreq: cpu0,cur:2167000,set:1984000,set ndiv:155
May 8 13:41:43 localhost kernel: [10470.741674] cpufreq: cpu0,cur:2187000,set:1984000,set ndiv:155
May 8 13:41:47 localhost kernel: [10474.800711] cpufreq: cpu0,cur:2198000,set:1984000,set ndiv:155
May 8 13:41:49 localhost kernel: [10476.828659] cpufreq: cpu0,cur:1806000,set:1984000,set ndiv:155
May 8 13:41:50 localhost kernel: [10477.843811] cpufreq: cpu0,cur:1847000,set:1984000,set ndiv:155
May 8 13:41:52 localhost kernel: [10479.874251] cpufreq: cpu4,cur:1822000,set:1984000,set ndiv:155
May 8 13:41:58 localhost kernel: [10485.959920] cpufreq: cpu0,cur:2145000,set:1984000,set ndiv:155
May 8 13:42:06 localhost kernel: [10494.075505] cpufreq: cpu0,cur:2140000,set:1984000,set ndiv:155
May 8 13:42:32 localhost kernel: [10519.437679] cpufreq: cpu0,cur:1842000,set:1984000,set ndiv:155
May 8 13:43:13 localhost kernel: [10561.033115] cpufreq: cpu0,cur:1844000,set:1984000,set ndiv:155

Orin NX 16GB最高的CPU freq只有1984000. 但你這邊有超過的狀態

由於你用的software版本是一個幾乎最舊的版本, 我只能請你嘗試升級看看BSP.

而且看起來你好像是在拿maxN直接給你的系統使用? 你確定你沒有打到OC的問題?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.