NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu when reboot on Orin NX module

I am using Jetson Orin NX 16GB Module(No EMMC) and custom carrier board, Jetson Linux R35.3.1, when I am reboot, 3 dumps met as follows

Orin_nx_reboot.txt (198.7 KB)

dump1:

[  232.556002] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
[  232.563384] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
[  232.571014] CPU: 3 PID: 1206 Comm: Xorg Tainted: G           OE     5.10.104-tegra #1
[  232.579061] Hardware name: NVIDIA Orin NX Developer Kit (DT)
[  232.584878] Call trace:
[  232.587398]  dump_backtrace+0x0/0x1d0
[  232.591152]  show_stack+0x30/0x40
[  232.594557]  dump_stack+0xd8/0x138
[  232.598115]  os_dump_stack+0x18/0x20 [nvidia]
[  232.602648]  tlsEntryGet+0x130/0x138 [nvidia]
[  232.607175]  gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
[  232.611972]  threadPriorityStateFree+0x234/0x2a0 [nvidia]
[  232.617567]  RmShutdownAdapter+0x168/0x268 [nvidia]
[  232.622636]  rm_shutdown_adapter+0x50/0x70 [nvidia]
[  232.627703]  nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
[  232.632855]  nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
[  232.638098]  nvidia_dev_put+0x38/0xc40 [nvidia]
[  232.642807]  nvkms_close_gpu+0x60/0x98 [nvidia_modeset]
[  232.648208]  nvRmFreeDeviceEvo+0x8c/0x130 [nvidia_modeset]
[  232.653874]  nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
[  232.659738]  nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
[  232.665732]  __arm64_sys_ioctl+0xac/0xf0
[  232.669766]  el0_svc_common.constprop.0+0x80/0x1d0
[  232.674690]  do_el0_svc+0x38/0xb0
[  232.678090]  el0_svc+0x1c/0x30
[  232.681220]  el0_sync_handler+0xa8/0xb0
[  232.685162]  el0_sync+0x16c/0x180

dump2:

[  234.717595] tegra-xusb 3610000.xhci: xHCI host controller not responding, assume dead
[  234.717787] CPU:0, Error: cbb-fabric@0x13a00000, irq=25
[  234.725647] tegra-xusb 3610000.xhci: HC died; cleaning up
[  234.729103] nvgpu: 17000000.ga10b      ga10b_intr_log_pending_intrs:306  [ERR]  Pending TOP[0]: 0x00000004, LEAF[4]: 0x11000000
[  234.731019] **************************************
[  234.753311] CPU:0, Error:cbb-fabric, Errmon:2
[  234.757789] 	  Error Code		: PWRDOWN_ERR
[  234.761821] 
[  234.763353] 	  Error Code		: PWRDOWN_ERR
[  234.767378] 	  MASTER_ID		: CCPLEX
[  234.770868] 	  Address		: 0x3610420
[  234.774460] 	  Cache			: 0x1 -- Bufferable 
[  234.778760] 	  Protection		: 0x2 -- Unprivileged, Non-Secure, Data Access
[  234.785731] 	  Access_Type		: Read
[  234.789223] 	  Access_ID		: 0x11
[  234.789227] 	  Fabric		: cbb-fabric
[  234.796131] 	  Slave_Id		: 0x22
[  234.797589] hub 1-0:1.0: activate --> -19
[  234.799358] 	  Burst_length		: 0x0
[  234.799362] 	  Burst_type		: 0x1
[  234.810281] 	  Beat_size		: 0x2
[  234.813507] 	  VQC			: 0x0
[  234.816287] 	  GRPSEC		: 0x7e
[  234.819337] 	  FALCONSEC		: 0x0
[  234.822564] 	**************************************

dump3:

[  234.827600] ------------[ cut here ]------------
[  234.832361] WARNING: CPU: 0 PID: 0 at drivers/soc/tegra/cbb/tegra234-cbb.c:577 tegra234_cbb_isr+0x130/0x170
[  234.842395] Modules linked in: nvidia_modeset(O) fuse sysdrv(OE) ccm3310s(E) lzo_rle lzo_compress zram ramoops reed_solomon loop hid_logitech_hidpp input_leds snd_soc_tegra186_dspk snd_soc_tegra210_ope snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra210_i2s snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 hid_logitech_dj sha1_ce snd_soc_tegra210_adsp snd_soc_spdif_tx snd_soc_tegra_machine_driver userspace_alert snd_hda_codec_hdmi r8168 snd_soc_tegra_utils snd_soc_simple_card_utils tegra_bpmp_thermal snd_soc_tegra210_ahub nvadsp tegra210_adma snd_hda_tegra snd_hda_codec r8169 snd_hda_core nv_imx219 realtek spi_tegra114 nvidia(O) binfmt_misc ina3221 pwm_fan nvgpu nvmap ip_tables x_tables [last unloaded: mtd]
[  234.930366] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE     5.10.104-tegra #1
[  234.938572] Hardware name: NVIDIA Orin NX Developer Kit (DT)
[  234.944373] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[  234.950533] pc : tegra234_cbb_isr+0x130/0x170
[  234.955007] lr : tegra234_cbb_isr+0x10c/0x170
[  234.959474] sp : ffff800010003e10
[  234.962873] x29: ffff800010003e10 x28: ffffdc1d4f2a2680 
[  234.968330] x27: 0000000000000001 x26: 0000000000000080 
[  234.973778] x25: ffffdc1d4ecb9ed0 x24: ffffdc1d4f60be40 
[  234.979233] x23: ffffdc1d4efa7000 x22: 0000000000000019 
[  234.984695] x21: ffffdc1d4f42ef20 x20: 0000000000000002 
[  234.990148] x19: ffffdc1d4f42ef10 x18: 0000000000000010 
[  234.995604] x17: 0000000000000000 x16: ffffdc1d4d5a3210 
[  235.001061] x15: ffffdc1d4f2a2bf0 x14: ffffffffffffffff 
[  235.006510] x13: ffff800090003917 x12: ffff80001000391f 
[  235.011967] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
[  235.017425] x9 : ffff800010003c30 x8 : 2a2a2a2a2a2a2a2a 
[  235.022872] x7 : 2a2a2a2a2a2a2a09 x6 : c0000000ffffefff 
[  235.028324] x5 : 0000000000057fa8 x4 : ffffdc1d4f2b7968 
[  235.033783] x3 : 0000000000000000 x2 : ffffdc1d4d73e170 
[  235.039231] x1 : ffffdc1d4f2a2680 x0 : 0000000100010001 
[  235.044684] Call trace:
[  235.047192]  tegra234_cbb_isr+0x130/0x170
[  235.051314]  __handle_irq_event_percpu+0x68/0x2a0
[  235.056141]  handle_irq_event_percpu+0x40/0xa0
[  235.060700]  handle_irq_event+0x50/0xf0
[  235.064637]  handle_fasteoi_irq+0xc0/0x170
[  235.068846]  generic_handle_irq+0x40/0x60
[  235.072960]  __handle_domain_irq+0x70/0xd0
[  235.077164]  gic_handle_irq+0x68/0x134
[  235.081008]  el1_irq+0xd0/0xÿÿ

please help me to solve this, thanks!

Was this issue able to reproduce on NV devkit?

I only have a Xavier NX devkit carrier board P3509 , and the same issue.

here is the reboot log
p3509+p3667_reboot.txt (186.2 KB)

Hi,

Does it only happen in reboot or something else? You have 3 dumps in the first post but you only have one in devkit case.

How to reproduce the other 2 dumps? Also running sudo reboot? Does it affect your usecase or system?

The 3 dump happen when I reboot the system. the first dump happens on every reboot , and the other two dumps happen occasionally.

the first dump may affect GPU, but I don’t know how to test GPU.

the second dump may affect usb (tegra-xusb 3610000.xhci), I warry about the controler will not work after reboot , I will test more to ensure this.

the third dump , I dont know what function it will affect.

please help me to solve this , thanks

Hi team, any ideas?

Looks like we cannot reproduce this issue on devkit. It this 100% happened if we reboot the device?

yes, the first dump 100% happen when reboot the module.

Hi team, any updates?

We are not able to reproduce this issue on both Orin NX 16 and 8GB. Please put your module to devkit and test again.

Remove every peripherals and only leave the serial console cable there.

Hi Wayne, please reproduce the issue like this

root@w:~# 
root@w:~# echo 7 4 1 7 > /proc/sys/kernel/printk
root@w:~# 
root@w:~# reboot 
[  113.799988] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
[  113.807360] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
[  113.814999] CPU: 1 PID: 1135 Comm: Xorg Tainted: G           O      5.10.104-tegra #1
[  113.823055] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS 3.1-32827747 03/19/2023
[  113.834421] Call trace:
[  113.836945]  dump_backtrace+0x0/0x1d0
[  113.840705]  show_stack+0x30/0x40
[  113.844115]  dump_stack+0xd8/0x138
[  113.847682]  os_dump_stack+0x18/0x20 [nvidia]
[  113.852223]  tlsEntryGet+0x130/0x138 [nvidia]
[  113.856761]  gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
[  113.861562]  threadPriorityStateFree+0x234/0x2a0 [nvidia]
[  113.867171]  RmShutdownAdapter+0x168/0x268 [nvidia]
[  113.872240]  rm_shutdown_adapter+0x50/0x70 [nvidia]
[  113.877311]  nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
[  113.882472]  nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
[  113.887731]  nvidia_dev_put+0x38/0xc40 [nvidia]
[  113.892435]  nvkms_close_gpu+0x60/0x98 [nvidia_modeset]
[  113.897835]  nvRmFreeDeviceEvo+0x8c/0x130 [nvidia_modeset]
[  113.903504]  nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
[  113.909376]  nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
[  113.915377]  __arm64_sys_ioctl+0xac/0xf0
[  113.919412]  el0_svc_common.constprop.0+0x80/0x1d0
[  113.924334]  do_el0_svc+0x38/0xb0
[  113.927740]  el0_svc+0x1c/0x30
[  113.930876]  el0_sync_handler+0xa8/0xb0
[  113.934816]  el0_sync+0x16c/0x180
[  115.515347] tegra_wdt_t18x 2190000.watchdog: Watchdog(0): wdt timeout set to 600 sec
[  115.523700] watchdog: watchdog0: watchdog did not stop!
[  115.549157] systemd-shutdown[1]: Syncing filesystems and block devices.
[  115.556202] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  115.568912] systemd-journald[268]: Received SIGTERM from PID 1 (systemd-shutdow).
[  115.600813] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  115.612287] systemd-shutdown[1]: Hardware watchdog 'Tegra WDT', version 1
[  115.620880] systemd-shutdown[1]: Unmounting file systems.
[  115.627935] [1969]: Remounting '/' read-only in with options '(null)'.
[  115.671230] EXT4-fs (nvme0n1p1): re-mounted. Opts: (null)
[  115.687570] systemd-shutdown[1]: All filesystems unmounted.
[  115.693334] systemd-shutdown[1]: Deactivating swaps.
[  115.698541] systemd-shutdown[1]: All swaps deactivated.
[  115.703920] systemd-shutdown[1]: Detaching loop devices.
[  115.710626] systemd-shutdown[1]: All loop devices detached.
[  115.716368] systemd-shutdown[1]: Detaching DM devices.
[  115.721754] systemd-shutdown[1]: All DM devices detached.
[  115.727323] systemd-shutdown[1]: All filesystems, swaps, loop devices and DM devices detached.
[  115.742865] systemd-shutdown[1]: Syncing filesystems and block devices.
[  115.749757] systemd-shutdown[1]: Rebooting.

Hi team, any updates ?

Hi @WayneWWW , any updates ?

Hi team , could you please help me to find the cause of the issue ?

if this does not cause any fatal on your board, please just ignore it.

thanks for your replication, have you already reproduced the issue on Orin NX ?

I can see the log. But as this is not fatal, we will consider checking this in lower priority.

I got it , please share the patch when it has been fixed, thanks for your positive help.

Hello, I’m also having the same issue. Has there been any update on this issue?