NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu when reboot on Orin NX module

WakkeWang · March 29, 2023, 11:31am

I am using Jetson Orin NX 16GB Module(No EMMC) and custom carrier board, Jetson Linux R35.3.1, when I am reboot, 3 dumps met as follows

Orin_nx_reboot.txt (198.7 KB)

dump1:

[  232.556002] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
[  232.563384] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
[  232.571014] CPU: 3 PID: 1206 Comm: Xorg Tainted: G           OE     5.10.104-tegra #1
[  232.579061] Hardware name: NVIDIA Orin NX Developer Kit (DT)
[  232.584878] Call trace:
[  232.587398]  dump_backtrace+0x0/0x1d0
[  232.591152]  show_stack+0x30/0x40
[  232.594557]  dump_stack+0xd8/0x138
[  232.598115]  os_dump_stack+0x18/0x20 [nvidia]
[  232.602648]  tlsEntryGet+0x130/0x138 [nvidia]
[  232.607175]  gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
[  232.611972]  threadPriorityStateFree+0x234/0x2a0 [nvidia]
[  232.617567]  RmShutdownAdapter+0x168/0x268 [nvidia]
[  232.622636]  rm_shutdown_adapter+0x50/0x70 [nvidia]
[  232.627703]  nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
[  232.632855]  nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
[  232.638098]  nvidia_dev_put+0x38/0xc40 [nvidia]
[  232.642807]  nvkms_close_gpu+0x60/0x98 [nvidia_modeset]
[  232.648208]  nvRmFreeDeviceEvo+0x8c/0x130 [nvidia_modeset]
[  232.653874]  nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
[  232.659738]  nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
[  232.665732]  __arm64_sys_ioctl+0xac/0xf0
[  232.669766]  el0_svc_common.constprop.0+0x80/0x1d0
[  232.674690]  do_el0_svc+0x38/0xb0
[  232.678090]  el0_svc+0x1c/0x30
[  232.681220]  el0_sync_handler+0xa8/0xb0
[  232.685162]  el0_sync+0x16c/0x180

dump2:

[  234.717595] tegra-xusb 3610000.xhci: xHCI host controller not responding, assume dead
[  234.717787] CPU:0, Error: cbb-fabric@0x13a00000, irq=25
[  234.725647] tegra-xusb 3610000.xhci: HC died; cleaning up
[  234.729103] nvgpu: 17000000.ga10b      ga10b_intr_log_pending_intrs:306  [ERR]  Pending TOP[0]: 0x00000004, LEAF[4]: 0x11000000
[  234.731019] **************************************
[  234.753311] CPU:0, Error:cbb-fabric, Errmon:2
[  234.757789] 	  Error Code		: PWRDOWN_ERR
[  234.761821] 
[  234.763353] 	  Error Code		: PWRDOWN_ERR
[  234.767378] 	  MASTER_ID		: CCPLEX
[  234.770868] 	  Address		: 0x3610420
[  234.774460] 	  Cache			: 0x1 -- Bufferable 
[  234.778760] 	  Protection		: 0x2 -- Unprivileged, Non-Secure, Data Access
[  234.785731] 	  Access_Type		: Read
[  234.789223] 	  Access_ID		: 0x11
[  234.789227] 	  Fabric		: cbb-fabric
[  234.796131] 	  Slave_Id		: 0x22
[  234.797589] hub 1-0:1.0: activate --> -19
[  234.799358] 	  Burst_length		: 0x0
[  234.799362] 	  Burst_type		: 0x1
[  234.810281] 	  Beat_size		: 0x2
[  234.813507] 	  VQC			: 0x0
[  234.816287] 	  GRPSEC		: 0x7e
[  234.819337] 	  FALCONSEC		: 0x0
[  234.822564] 	**************************************

dump3:

[  234.827600] ------------[ cut here ]------------
[  234.832361] WARNING: CPU: 0 PID: 0 at drivers/soc/tegra/cbb/tegra234-cbb.c:577 tegra234_cbb_isr+0x130/0x170
[  234.842395] Modules linked in: nvidia_modeset(O) fuse sysdrv(OE) ccm3310s(E) lzo_rle lzo_compress zram ramoops reed_solomon loop hid_logitech_hidpp input_leds snd_soc_tegra186_dspk snd_soc_tegra210_ope snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra210_i2s snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 hid_logitech_dj sha1_ce snd_soc_tegra210_adsp snd_soc_spdif_tx snd_soc_tegra_machine_driver userspace_alert snd_hda_codec_hdmi r8168 snd_soc_tegra_utils snd_soc_simple_card_utils tegra_bpmp_thermal snd_soc_tegra210_ahub nvadsp tegra210_adma snd_hda_tegra snd_hda_codec r8169 snd_hda_core nv_imx219 realtek spi_tegra114 nvidia(O) binfmt_misc ina3221 pwm_fan nvgpu nvmap ip_tables x_tables [last unloaded: mtd]
[  234.930366] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE     5.10.104-tegra #1
[  234.938572] Hardware name: NVIDIA Orin NX Developer Kit (DT)
[  234.944373] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[  234.950533] pc : tegra234_cbb_isr+0x130/0x170
[  234.955007] lr : tegra234_cbb_isr+0x10c/0x170
[  234.959474] sp : ffff800010003e10
[  234.962873] x29: ffff800010003e10 x28: ffffdc1d4f2a2680 
[  234.968330] x27: 0000000000000001 x26: 0000000000000080 
[  234.973778] x25: ffffdc1d4ecb9ed0 x24: ffffdc1d4f60be40 
[  234.979233] x23: ffffdc1d4efa7000 x22: 0000000000000019 
[  234.984695] x21: ffffdc1d4f42ef20 x20: 0000000000000002 
[  234.990148] x19: ffffdc1d4f42ef10 x18: 0000000000000010 
[  234.995604] x17: 0000000000000000 x16: ffffdc1d4d5a3210 
[  235.001061] x15: ffffdc1d4f2a2bf0 x14: ffffffffffffffff 
[  235.006510] x13: ffff800090003917 x12: ffff80001000391f 
[  235.011967] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
[  235.017425] x9 : ffff800010003c30 x8 : 2a2a2a2a2a2a2a2a 
[  235.022872] x7 : 2a2a2a2a2a2a2a09 x6 : c0000000ffffefff 
[  235.028324] x5 : 0000000000057fa8 x4 : ffffdc1d4f2b7968 
[  235.033783] x3 : 0000000000000000 x2 : ffffdc1d4d73e170 
[  235.039231] x1 : ffffdc1d4f2a2680 x0 : 0000000100010001 
[  235.044684] Call trace:
[  235.047192]  tegra234_cbb_isr+0x130/0x170
[  235.051314]  __handle_irq_event_percpu+0x68/0x2a0
[  235.056141]  handle_irq_event_percpu+0x40/0xa0
[  235.060700]  handle_irq_event+0x50/0xf0
[  235.064637]  handle_fasteoi_irq+0xc0/0x170
[  235.068846]  generic_handle_irq+0x40/0x60
[  235.072960]  __handle_domain_irq+0x70/0xd0
[  235.077164]  gic_handle_irq+0x68/0x134
[  235.081008]  el1_irq+0xd0/0xÿÿ

please help me to solve this, thanks!

WayneWWW · March 30, 2023, 1:27am

Was this issue able to reproduce on NV devkit?

WakkeWang · March 30, 2023, 1:49am

I only have a Xavier NX devkit carrier board P3509 , and the same issue.

here is the reboot log
p3509+p3667_reboot.txt (186.2 KB)

WayneWWW · March 30, 2023, 1:53am

Hi,

Does it only happen in reboot or something else? You have 3 dumps in the first post but you only have one in devkit case.

How to reproduce the other 2 dumps? Also running sudo reboot? Does it affect your usecase or system?

WakkeWang · March 30, 2023, 2:03am

The 3 dump happen when I reboot the system. the first dump happens on every reboot , and the other two dumps happen occasionally.

WakkeWang · March 30, 2023, 2:17am

the first dump may affect GPU, but I don’t know how to test GPU.

the second dump may affect usb (tegra-xusb 3610000.xhci), I warry about the controler will not work after reboot , I will test more to ensure this.

the third dump , I dont know what function it will affect.

please help me to solve this , thanks

WakkeWang · March 31, 2023, 7:03am

Hi team, any ideas?

WayneWWW · March 31, 2023, 7:07am

Looks like we cannot reproduce this issue on devkit. It this 100% happened if we reboot the device?

WakkeWang · March 31, 2023, 7:21am

yes, the first dump 100% happen when reboot the module.

WakkeWang · April 3, 2023, 1:29am

Hi team, any updates?

WayneWWW · April 6, 2023, 3:09am

We are not able to reproduce this issue on both Orin NX 16 and 8GB. Please put your module to devkit and test again.

Remove every peripherals and only leave the serial console cable there.

WakkeWang · April 7, 2023, 7:03am

Hi Wayne, please reproduce the issue like this

root@w:~# 
root@w:~# echo 7 4 1 7 > /proc/sys/kernel/printk
root@w:~# 
root@w:~# reboot 
[  113.799988] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
[  113.807360] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
[  113.814999] CPU: 1 PID: 1135 Comm: Xorg Tainted: G           O      5.10.104-tegra #1
[  113.823055] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS 3.1-32827747 03/19/2023
[  113.834421] Call trace:
[  113.836945]  dump_backtrace+0x0/0x1d0
[  113.840705]  show_stack+0x30/0x40
[  113.844115]  dump_stack+0xd8/0x138
[  113.847682]  os_dump_stack+0x18/0x20 [nvidia]
[  113.852223]  tlsEntryGet+0x130/0x138 [nvidia]
[  113.856761]  gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
[  113.861562]  threadPriorityStateFree+0x234/0x2a0 [nvidia]
[  113.867171]  RmShutdownAdapter+0x168/0x268 [nvidia]
[  113.872240]  rm_shutdown_adapter+0x50/0x70 [nvidia]
[  113.877311]  nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
[  113.882472]  nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
[  113.887731]  nvidia_dev_put+0x38/0xc40 [nvidia]
[  113.892435]  nvkms_close_gpu+0x60/0x98 [nvidia_modeset]
[  113.897835]  nvRmFreeDeviceEvo+0x8c/0x130 [nvidia_modeset]
[  113.903504]  nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
[  113.909376]  nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
[  113.915377]  __arm64_sys_ioctl+0xac/0xf0
[  113.919412]  el0_svc_common.constprop.0+0x80/0x1d0
[  113.924334]  do_el0_svc+0x38/0xb0
[  113.927740]  el0_svc+0x1c/0x30
[  113.930876]  el0_sync_handler+0xa8/0xb0
[  113.934816]  el0_sync+0x16c/0x180
[  115.515347] tegra_wdt_t18x 2190000.watchdog: Watchdog(0): wdt timeout set to 600 sec
[  115.523700] watchdog: watchdog0: watchdog did not stop!
[  115.549157] systemd-shutdown[1]: Syncing filesystems and block devices.
[  115.556202] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  115.568912] systemd-journald[268]: Received SIGTERM from PID 1 (systemd-shutdow).
[  115.600813] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  115.612287] systemd-shutdown[1]: Hardware watchdog 'Tegra WDT', version 1
[  115.620880] systemd-shutdown[1]: Unmounting file systems.
[  115.627935] [1969]: Remounting '/' read-only in with options '(null)'.
[  115.671230] EXT4-fs (nvme0n1p1): re-mounted. Opts: (null)
[  115.687570] systemd-shutdown[1]: All filesystems unmounted.
[  115.693334] systemd-shutdown[1]: Deactivating swaps.
[  115.698541] systemd-shutdown[1]: All swaps deactivated.
[  115.703920] systemd-shutdown[1]: Detaching loop devices.
[  115.710626] systemd-shutdown[1]: All loop devices detached.
[  115.716368] systemd-shutdown[1]: Detaching DM devices.
[  115.721754] systemd-shutdown[1]: All DM devices detached.
[  115.727323] systemd-shutdown[1]: All filesystems, swaps, loop devices and DM devices detached.
[  115.742865] systemd-shutdown[1]: Syncing filesystems and block devices.
[  115.749757] systemd-shutdown[1]: Rebooting.

WakkeWang · April 10, 2023, 2:08am

Hi team, any updates ?

WakkeWang · April 11, 2023, 3:29am

Hi @WayneWWW , any updates ?

WakkeWang · April 14, 2023, 3:16am

Hi team , could you please help me to find the cause of the issue ?

WayneWWW · April 14, 2023, 6:23am

if this does not cause any fatal on your board, please just ignore it.

WakkeWang · April 14, 2023, 9:06am

WakkeWang:

root@w:~# echo 7 4 1 7 > /proc/sys/kernel/printk
root@w:~# 
root@w:~# reboot 
[  113.799988] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
[  113.807360] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
[  113.814999] CPU: 1 PID: 1135 Comm: Xorg Tainted: G           O      5.10.104-tegra #1
[  113.823055] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS 3.1-32827747 03/19/2023
[  113.834421] Call trace:
[  113.836945]  dump_backtrace+0x0/0x1d0
[  113.840705]  show_stack+0x30/0x40
[  113.844115]  dump_stack+0xd8/0x138
[  113.847682]  os_dump_stack+0x18/0x20 [nvidia]
[  113.852223]  tlsEntryGet+0x130/0x138 [nvidia]
[  113.856761]  gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
[  113.861562]  threadPriorityStateFree+0x234/0x2a0 [nvidia]
[  113.867171]  RmShutdownAdapter+0x168/0x268 [nvidia]
[  113.872240]  rm_shutdown_adapter+0x50/0x70 [nvidia]
[  113.877311]  nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
[  113.882472]  nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
[  113.887731]  nvidia_dev_put+0x38/0xc40 [nvidia]
[  113.892435]  nvkms_close_gpu+0x60/0x98 [nvidia_modeset]
[  113.897835]  nvRmFreeDeviceEvo+0x8c/0x130 [nvidia_modeset]
[  113.903504]  nvkms_ioctl_common+0x180/0x1b0 [nvidia_modeset]
[  113.909376]  nvidia_frontend_unlocked_ioctl+0x5c/0x78 [nvidia]
[  113.915377]  __arm64_sys_ioctl+0xac/0xf0
[  113.919412]  el0_svc_common.constprop.0+0x80/0x1d0
[  113.924334]  do_el0_svc+0x38/0xb0
[  113.927740]  el0_svc+0x1c/0x30
[  113.930876]  el0_sync_handler+0xa8/0xb0
[  113.934816]  el0_sync+0x16c/0x180

thanks for your replication, have you already reproduced the issue on Orin NX ?

WayneWWW · April 14, 2023, 9:18am

I can see the log. But as this is not fatal, we will consider checking this in lower priority.

WakkeWang · April 14, 2023, 9:27am

I got it , please share the patch when it has been fixed, thanks for your positive help.

joe-sandom · December 6, 2023, 12:43pm

Hello, I’m also having the same issue. Has there been any update on this issue?

Topic		Replies	Views
Orin system crash Jetson AGX Orin linux	15	1195	May 17, 2023
Black screen on agx orin Jetson AGX Orin system-management-and-architecture	12	692	February 1, 2024
ORIN RGMII can not LINK (phy 88e1512P) Jetson AGX Orin ethernet	34	1508	August 22, 2023
Orin NX HDMI kernel panic when shutdown and connect Dell monitor Jetson Orin NX kernel , nvbugs , hdmi	48	4197	December 20, 2023
Orin 32G reboot Jetson AGX Orin board-design , nvbugs , reboot	38	1656	May 30, 2023
Failed to shutdown DevKit (p3710) DRIVE AGX Orin General drive-platform-setup	9	2026	October 27, 2022
Unable to start orin nx Jetson Orin NX boot	60	1077	February 29, 2024
Orin tegra_uart_tx_dma_complete kernel panic on jetpack5.1.2 Jetson AGX Orin uart , preempt_rt	12	685	November 9, 2023
WARNING occurs when plugin/out HDMI in OrinNX Jetson Orin NX board-design , nvbugs , hdmi	46	3930	July 17, 2023
Orin Nano Display port hot plug fail Jetson Orin Nano dp-display	10	24	December 19, 2024

NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu when reboot on Orin NX module

Related topics