ORIN NX 16GB,JetPack版本为5.1.2,在板卡下上电过程中,出现多次程序无法调用GPU的现象

英伟达团队你们好。
我们交付的一个项目中使用了6块ORIN NX 16G的核心板,搭配自制的载板,JetPack版本是5.1.2。近期发现其中一个板卡在上电启动后,多次出现程序无法调用GPU,发生概率大于50%,而程序在其它的5个板子上均可正常调用GPU运行。附件是无法调用GPU板卡的启动日志,我们在其中发现有以下的日志信息:

Mar 11 19:40:47 node-101-13 nvpmodel.sh[1198]: NVPM ERROR: Error opening /sys/devices/gpu.0/tpc_pg_mask: 2
Mar 11 19:40:47 node-101-13 nvpmodel.sh[1198]: NVPM ERROR: optMask is 2, no request for power mode
Mar 11 19:40:47 node-101-13 systemd[1]: nvpmodel.service: Main process exited, code=exited, status=234/n/a
Mar 11 19:40:47 node-101-13 systemd[1]: nvpmodel.service: Failed with result 'exit-code'.
Mar 11 19:40:47 node-101-13 systemd[1]: Failed to start nvpmodel service.


Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (II) NVIDIA GLX Module  35.4.1  Release Build  (bugfix_main)  (buildbrain@mobile-u64-6422-d7000)  Tue Aug  1 12:38:45 PDT 2023
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (II) NVIDIA: The X server supports PRIME Render Offload.
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (EE) NVIDIA(0): Failing initialization of X screen



ar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.config - Load config from /usr/local/jtop/config.json
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.service - jetson_stats 4.2.6 - server loaded
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.service - Running on Python: 3.8.10
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - Hardware detected aarch64
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - NVIDIA Jetson 699-level Part Number=699-13767-0000-301 G.1
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - NVIDIA Jetson Module=NVIDIA Jetson Orin NX (16GB ram)
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.4.1
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.cpu - Found 8 CPU
Mar 12 08:15:55 node-101-13 jtop[22261]: [WARNING] jtop.core.gpu - No NVIDIA GPU available
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.processes - Process service started
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.memory - Found EMC!
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.memory - Memory service started
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.engine - Engines found: [APE DLA0 DLA1 NVDEC NVENC NVJPG OFA PVA0 SE VIC]
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CV0" in thermal_zone2
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CPU" in thermal_zone0
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "SOC2" in thermal_zone7
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "SOC0" in thermal_zone5
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CV1" in thermal_zone3
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone1
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "tj" in thermal_zone8
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "SOC1" in thermal_zone6
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CV2" in thermal_zone4
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.power - Alarms VDD_IN - {'crit_alarm': 0, 'max_alarm': 0}
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.power - Alarms VDD_CPU_GPU_CV - {'crit_alarm': 0, 'max_alarm': 0}
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.power - Alarms VDD_SOC - {'crit_alarm': 0, 'max_alarm': 0}

我们不确定是否是在系统启动过程中GPU加载失败,麻烦帮忙分析一下
0312-syslog.txt (229.8 KB)

please share lsmod result after you boot up.

好的,我联系现场,取到这个信息后补充一下。除了查看内核模块的信息,还需要别的吗?

基本上現在看到的log是你們的"nvgpu" 這個driver的log沒有出現.

如果lsmod裡面也沒有的話, 麻煩手動insmod之後看是不是有碰上什麼錯誤.

您好!我们取信息是这样的:
1,上电启动后,日志中出现 如下信息:

Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: #011compiled for 1.6.99.901, module version = 1.0.0
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: #011Module class: X.Org Server Extension
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) NVIDIA GLX Module  35.4.1  Release Build  (bugfix_main)  (buildbrain@mobile-u64-6422-d7000)  Tue Aug  1 12:38:45 PDT 2023
Mar 31 20:09:00 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) NVIDIA: The X server supports PRIME Render Offload.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) NVIDIA(0): Failing initialization of X screen
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadModule: "nvidia"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadSubModule: "glxserver_nvidia"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) Unloading glxserver_nvidia
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadSubModule: "wfb"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadSubModule: "fb"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) Screen(s) found, but none have a usable configuration.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: Fatal server error:
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) no screens found(EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: Please consult the The X.Org Foundation support
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: #011 at http://wiki.x.org
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]:  for help.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) Server terminated with error (1). Closing log file.



r 31 20:11:44 node-101-13 jtop[29556]: [INFO] jtop.core.config - Load config from /usr/local/jtop/config.json
Mar 31 20:11:44 node-101-13 jtop[29556]: [INFO] jtop.service - jetson_stats 4.2.6 - server loaded
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.service - Running on Python: 3.8.10
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - Hardware detected aarch64
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - NVIDIA Jetson 699-level Part Number=699-13767-0000-301 G.1
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - NVIDIA Jetson Module=NVIDIA Jetson Orin NX (16GB ram)
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.4.1
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.cpu - Found 8 CPU
Mar 31 20:11:45 node-101-13 jtop[29556]: [WARNING] jtop.core.gpu - No NVIDIA GPU available
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.processes - Process service started
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.memory - Found EMC!
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.memory - Memory service started
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.engine - Engines found: [APE DLA0 DLA1 NVDEC NVENC NVJPG OFA PVA0 SE VIC]```

2,通过lsmod查看,模块中无nvgpu
3,通过insmod 手动加载nvgpu也正常,无报错信息
4,手动加载nvgpu之后 ,我们的程序可以正常调用gpu了

dmesg.txt (70.4 KB)
lsmod.txt (3.0 KB)


syslog.txt (213.6 KB)

附件是syslog、dmesg、lsmod等信息,麻烦帮忙分析一下,上电启动中nvgpu模块没有加载的具体原因是什么?

麻煩再抓一下uname -r

uname -r 为5.10.120-tegra

從一開始的說法來看, 能否確認一下這個情形 (lsmod沒有nvgpu)是否是偶爾才發生, 還是說每次開機之後都會發生?

昨晚我们进行了2次测试,在启动日志中未出现No NVIDIA GPU available的时候,lsmod中是有nvgpu的;
随后我将板卡进行了重启,在启动日志中出现NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!、以及No NVIDIA GPU available的时候,lsmod中没有nvgpu

您好,请问这个问题现在有什么进展吗? 是否还需要补充其它的信息

基本上原因已經找到了. 就是nvgpu driver沒辦法在第一時間probe.

由於這一個版本已經很舊了. 也沒有其他用戶報過類似的問題. 我只能建議你自己先確認一下為什麼有時候nvgpu沒有辦法modprobe.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.