System hang occasionally by nvidia driver

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
[*] DRIVE OS 6.0.4 SDK
other

Target Operating System
[*] Linux
QNX
other

Hardware Platform
[*] DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
[*] 1.9.1.10844
other

Host Machine Version
[*] native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

nvidia dirve print below info, sometimes caused system hang. I cannot ssh, but ping IP is okay!

Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023688] CPU: 0 PID: 1332 Comm: Xorg Tainted: G O 5.10.104-rt63-tegra #1
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023691] Hardware name: p3710-0010 (DT)
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023693] Call trace:
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023694] dump_backtrace+0x0/0x1d0
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023703] show_stack+0x2c/0x40
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023706] dump_stack+0xd8/0x138
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023711] os_dump_stack+0x14/0x1c [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023796] tlsEntryGet+0x130/0x138 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023871] ctxdmaConstruct_IMPL+0x360/0x3a8 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.023946] __nvoc_ctor_ContextDma+0x74/0xa8 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024020] __nvoc_objCreate_ContextDma+0x78/0x110 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024094] __nvoc_objCreateDynamic+0x50/0x70 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024167] resservResourceFactory+0x74/0x100 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024239] indexRemove+0x294/0x4f0 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024313] serverAllocResourceUnderLock+0x21c/0x6b0 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024385] serverAllocResource+0x240/0x2e0 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024459] rmapiAllocWithSecInfo+0x180/0x2e0 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024534] rmapiAllocWithSecInfoTls+0x74/0xa8 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024608] rmapiControlWithSecInfoTls+0x4a8/0x520 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024682] nvkms_call_rm+0x5c/0x90 [nvidia_modeset]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024741] nvRmApiAlloc+0x30/0x40 [nvidia_modeset]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024789] nvkms_ioctl_common+0x174/0x1a0 [nvidia_modeset]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024836] nvidia_frontend_unlocked_ioctl+0x58/0x78 [nvidia]
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024910] __arm64_sys_ioctl+0xa8/0xf0
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024915] el0_svc_common.constprop.0+0x7c/0x1c0
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024919] do_el0_svc+0x34/0xa0
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024921] el0_svc+0x1c/0x30
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024925] el0_sync_handler+0xa8/0xb0
Feb 7 18:04:42 tegra-ubuntu kernel: [ 9.024928] el0_sync+0x16c/0x180

Dear @ming.xu4,
If possible, can you upgrade to DRIVE OS 6.0.5 and test the issue? Could you also elaborate or share repro steps that causes this issue to trigger?

Hello @SivaRamaKrishnaNV
Reboot system, you can find the backtrace log in file /var/log/kern.log or /var/log/syslog
but, system hang is random.

update:
it looks we encountered same issue(link as below), sometimes desktop freeze, I cannot ssh, but ping ok.

Dear @ming.xu4,
I could access my machine flashed with DRIVE OS 6.0.5 with out any issue.
Could you check flashing DRIVE OS 6.0.5 and verify?

same issue on 6.0.5.

nvidia@tegra-ubuntu:~$ cat /etc/nvidia/version-ubuntu-rootfs.txt
6.0.5.0-31732390

kernel log:
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949130] NVRM nvAssertFailedNoLog: Assertion failed: thisAddress < pMapping->gpuNvLength @ os.c:1612
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949140] CPU: 1 PID: 1830 Comm: Xorg Tainted: G O 5.10.120-rt70-tegra #1
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949143] Hardware name: p3710-0010 (DT)
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949145] Call trace:
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949146] dump_backtrace+0x0/0x1d0
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949155] show_stack+0x30/0x50
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949159] dump_stack+0xd8/0x140
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949164] os_dump_stack+0x18/0x20 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949252] tlsEntryGet+0x130/0x138 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949329] osDevReadReg032+0x4c/0x70 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949405] ioaprtReadReg32_IMPL+0x140/0x1c8 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949481] gpuFuseSupportsDisplay_T234D+0x28/0x38 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949557] kdispStatePreInitLocked_IMPL+0x28/0xd8 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949633] gpuStatePreInit_IMPL+0x1f8/0x680 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949709] gpumgrStatePreInitGpu+0x74/0xa8 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949784] RmInitAdapter+0x580/0xe28 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949860] rm_init_adapter+0xa8/0xb8 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.949936] nvidia_isr_kthread_bh+0x51c/0xd48 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950011] nvidia_dev_get+0x3c/0x88 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950086] nvkms_open_gpu+0x64/0xa8 [nvidia_modeset]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950146] nvRmAllocDeviceEvo+0x654/0x848 [nvidia_modeset]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950194] nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950242] nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950320] __arm64_sys_ioctl+0xb4/0x110
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950324] el0_svc_common.constprop.0+0x80/0x1f0
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950328] do_el0_svc+0x38/0x90
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950330] el0_svc+0x1c/0x30
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950333] el0_sync_handler+0xb8/0xc0
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.950334] el0_sync+0x16c/0x180
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.951293] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080017e result 0x56:
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.951450] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080013f result 0x56:
Feb 9 06:04:22 tegra-ubuntu kernel: [ 21.952437] NVRM rmapiAllocWithSecInfo: allocation failed; status: Given class-id not valid [NV_ERR_INVALID_CLASS] (0

Dear @ming.xu4,
You notice above log message when system freezes? How frequently do you see system hang? Do you see any pattern to reproduce the issue?

Not frequently, 3-4 days, I can see it.
it’s hard to reproduce it, I didn’t find any pattern, sometimes happened on moving mouse, sometimes when compiling code, sometimes when running tensorRT.
Just noticed the screen freezed, frankly speeking, didn’t see any kernel crash log when desktop freezed.
here is the suspected point.

Dear @ming.xu4,
It is difficult to analyze with out repro steps. Do you still notice this freezing issue?

Dear @SivaRamaKrishnaNV
I am still observing this issue.
After disable “X” with console mode(nvidia not support, but can ssh), the system works well in past 3 weeks , never hang.
I will swicth back to gnome desktop, will update results later.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.