SciBufModule open fails with libnvrmgpu error

jmanning2 · December 18, 2023, 3:28pm

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.9.3.10904
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Hi, when creating an NvSciBufModule via NvSciBufModuleOpen() we’re getting a return code of NvSciError_ResourceError and this error msg libnvrm_gpu.so: NvRmGpuLibOpen failed, error=14. This issue appears to persist until a reboot.

In dmesg, we also see these error messages that may/may not be related:

[  326.555302] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:90   [ERR]  Reported err_id(0x22c) to Safety_Services
[  326.555323] nvgpu: 17000000.ga10b           acr_report_error_to_sdl:53   [ERR]  ACR register access failure
[  326.555327] nvgpu: 17000000.ga10b     nvgpu_acr_wait_for_completion:143  [ERR]  flcn-1: HS ucode boot failed, err 1b
[  326.555331] nvgpu: 17000000.ga10b     nvgpu_acr_wait_for_completion:145  [ERR]  flcn-1: Mailbox-1 : 0xabcd1234
[  326.555335] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [INFO]  Polling is in progress
[  326.555839] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:90   [ERR]  Reported err_id(0x296) to Safety_Services
[  326.555842] nvgpu: 17000000.ga10b nvgpu_pmu_report_bar0_pri_err_status:41   [ERR]  PMU falcon bar0 timeout. status(0x0), error_type(0xc)
[  326.555844] nvgpu: 17000000.ga10b            ga10b_bootstrap_hs_acr:72   [ERR]  ACR bootstrap failed
[  326.555846] nvgpu: 17000000.ga10b        nvgpu_acr_bootstrap_hs_acr:85   [ERR]  ACR bootstrap failed
[  326.555847] nvgpu: 17000000.ga10b       nvgpu_acr_construct_execute:108  [ERR]  Bootstrap HS ACR failed
[  326.555849] nvgpu: 17000000.ga10b            nvgpu_finalize_poweron:1023 [ERR]  Failed initialization for: g->ops.acr.acr_construct_execute
[  326.615794] nvgpu: 17000000.ga10b                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
[  326.616015] cdi-mgr sipl_devblk_0: cdi_mgr_wait_err: wait_event_interruptible failed

We have seen this libnvrm_gpu.so: NvRmGpuLibOpen failed, error=14 issue on both DriveOS 6.0.6 and DriveOS 6.0.8.1. Can Nvidia provide any insights into what may be causing this & any other potential resolutions besides rebooting the device?

Thank you.

VickNV · December 18, 2023, 4:50pm

Please try if reflashing can address this issue. If it still occurs, please try any sample applications calling NvSciBufModuleOpen() to see if they encounter the same issue.

jmanning2 · December 18, 2023, 5:08pm

We have tried reflashing but this error continues to pop up. Next time we see this issue I will try running sample applications, but I imagine we’ll see the same error from libnvrm_gpu.so & the status code returned will be NvSciError_ResourceError. Is there anything else we can do?

VickNV · December 18, 2023, 6:11pm

Could you confirm whether this error occurred after issuing a ‘reboot’ command, such as ‘sudo reboot now’?

jmanning2 · December 18, 2023, 6:20pm

Likely not, generally we reboot our orins via /sys/class/tegra_hv_pm_ctl/tegra_hv_pm_ctl/device/trigger_sys_reboot, not reboot.

VickNV · December 18, 2023, 6:36pm

That should be the correct way. In our previous experience, it’s due to the ‘reboot’ command. Please help pay attention next time.

jmanning2 · January 5, 2024, 7:41pm

We have reboot aliased to call trigger_sys_reboot, so we’re not ever calling it directly. Are there other cases that could trigger this error state? For example, if we cut power directly, could it fall into this state? Would we need to then run trigger_sys_reboot to “cleanup” the state? We’ve seen trigger_sys_reboot fail & so we cut power.

VickNV · January 5, 2024, 11:52pm

Can you please check if you observe the same issue when running any sample applications that call NvSciBufModuleOpen()? Additionally, it would be helpful to test if directly cutting power leads to this state.

jmanning2 · January 8, 2024, 4:50pm

Yes, this issue also occurs with sample applications, such as nvsipl_camera.

VickNV · January 8, 2024, 5:49pm

Could you clarify if cutting power or calling trigger_sys_reboot leads to this error state? Please provide the steps to reproduce the issue and the complete nvsipl_camera command. Thanks.

system · February 13, 2024, 5:37am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.