英伟达团队你们好。
我们交付的一个项目中使用了6块ORIN NX 16G的核心板,搭配自制的载板,JetPack版本是5.1.2。近期发现其中一个板卡在上电启动后,多次出现程序无法调用GPU,发生概率大于50%,而程序在其它的5个板子上均可正常调用GPU运行。附件是无法调用GPU板卡的启动日志,我们在其中发现有以下的日志信息:
Mar 11 19:40:47 node-101-13 nvpmodel.sh[1198]: NVPM ERROR: Error opening /sys/devices/gpu.0/tpc_pg_mask: 2
Mar 11 19:40:47 node-101-13 nvpmodel.sh[1198]: NVPM ERROR: optMask is 2, no request for power mode
Mar 11 19:40:47 node-101-13 systemd[1]: nvpmodel.service: Main process exited, code=exited, status=234/n/a
Mar 11 19:40:47 node-101-13 systemd[1]: nvpmodel.service: Failed with result 'exit-code'.
Mar 11 19:40:47 node-101-13 systemd[1]: Failed to start nvpmodel service.
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (II) NVIDIA GLX Module 35.4.1 Release Build (bugfix_main) (buildbrain@mobile-u64-6422-d7000) Tue Aug 1 12:38:45 PDT 2023
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (II) NVIDIA: The X server supports PRIME Render Offload.
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
Mar 11 19:41:43 node-101-13 /usr/lib/gdm3/gdm-x-session[11419]: (EE) NVIDIA(0): Failing initialization of X screen
ar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.config - Load config from /usr/local/jtop/config.json
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.service - jetson_stats 4.2.6 - server loaded
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.service - Running on Python: 3.8.10
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - Hardware detected aarch64
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - NVIDIA Jetson 699-level Part Number=699-13767-0000-301 G.1
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - NVIDIA Jetson Module=NVIDIA Jetson Orin NX (16GB ram)
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.4.1
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.cpu - Found 8 CPU
Mar 12 08:15:55 node-101-13 jtop[22261]: [WARNING] jtop.core.gpu - No NVIDIA GPU available
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.processes - Process service started
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.memory - Found EMC!
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.memory - Memory service started
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.engine - Engines found: [APE DLA0 DLA1 NVDEC NVENC NVJPG OFA PVA0 SE VIC]
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CV0" in thermal_zone2
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CPU" in thermal_zone0
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "SOC2" in thermal_zone7
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "SOC0" in thermal_zone5
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CV1" in thermal_zone3
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone1
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "tj" in thermal_zone8
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "SOC1" in thermal_zone6
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.temperature - Found thermal "CV2" in thermal_zone4
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.power - Alarms VDD_IN - {'crit_alarm': 0, 'max_alarm': 0}
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.power - Alarms VDD_CPU_GPU_CV - {'crit_alarm': 0, 'max_alarm': 0}
Mar 12 08:15:55 node-101-13 jtop[22261]: [INFO] jtop.core.power - Alarms VDD_SOC - {'crit_alarm': 0, 'max_alarm': 0}
我们不确定是否是在系统启动过程中GPU加载失败,麻烦帮忙分析一下
0312-syslog.txt (229.8 KB)
please share lsmod result after you boot up.
好的,我联系现场,取到这个信息后补充一下。除了查看内核模块的信息,还需要别的吗?
基本上現在看到的log是你們的"nvgpu" 這個driver的log沒有出現.
如果lsmod裡面也沒有的話, 麻煩手動insmod之後看是不是有碰上什麼錯誤.
您好!我们取信息是这样的:
1,上电启动后,日志中出现 如下信息:
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: #011compiled for 1.6.99.901, module version = 1.0.0
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: #011Module class: X.Org Server Extension
Mar 31 20:08:59 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) NVIDIA GLX Module 35.4.1 Release Build (bugfix_main) (buildbrain@mobile-u64-6422-d7000) Tue Aug 1 12:38:45 PDT 2023
Mar 31 20:09:00 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) NVIDIA: The X server supports PRIME Render Offload.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) NVIDIA(0): Failing initialization of X screen
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadModule: "nvidia"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadSubModule: "glxserver_nvidia"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) Unloading glxserver_nvidia
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadSubModule: "wfb"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (II) UnloadSubModule: "fb"
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) Screen(s) found, but none have a usable configuration.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: Fatal server error:
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) no screens found(EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: Please consult the The X.Org Foundation support
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: #011 at http://wiki.x.org
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: for help.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE)
Mar 31 20:09:01 node-101-13 /usr/lib/gdm3/gdm-x-session[9225]: (EE) Server terminated with error (1). Closing log file.
r 31 20:11:44 node-101-13 jtop[29556]: [INFO] jtop.core.config - Load config from /usr/local/jtop/config.json
Mar 31 20:11:44 node-101-13 jtop[29556]: [INFO] jtop.service - jetson_stats 4.2.6 - server loaded
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.service - Running on Python: 3.8.10
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - Hardware detected aarch64
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - NVIDIA Jetson 699-level Part Number=699-13767-0000-301 G.1
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - NVIDIA Jetson Module=NVIDIA Jetson Orin NX (16GB ram)
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.4.1
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.cpu - Found 8 CPU
Mar 31 20:11:45 node-101-13 jtop[29556]: [WARNING] jtop.core.gpu - No NVIDIA GPU available
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.processes - Process service started
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.memory - Found EMC!
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.memory - Memory service started
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
Mar 31 20:11:45 node-101-13 jtop[29556]: [INFO] jtop.core.engine - Engines found: [APE DLA0 DLA1 NVDEC NVENC NVJPG OFA PVA0 SE VIC]```
2,通过lsmod查看,模块中无nvgpu
3,通过insmod 手动加载nvgpu也正常,无报错信息
4,手动加载nvgpu之后 ,我们的程序可以正常调用gpu了
dmesg.txt (70.4 KB)
lsmod.txt (3.0 KB)
syslog.txt (213.6 KB)
附件是syslog、dmesg、lsmod等信息,麻烦帮忙分析一下,上电启动中nvgpu模块没有加载的具体原因是什么?
從一開始的說法來看, 能否確認一下這個情形 (lsmod沒有nvgpu)是否是偶爾才發生, 還是說每次開機之後都會發生?
昨晚我们进行了2次测试,在启动日志中未出现No NVIDIA GPU available的时候,lsmod中是有nvgpu的;
随后我将板卡进行了重启,在启动日志中出现NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!、以及No NVIDIA GPU available的时候,lsmod中没有nvgpu
您好,请问这个问题现在有什么进展吗? 是否还需要补充其它的信息
基本上原因已經找到了. 就是nvgpu driver沒辦法在第一時間probe.
由於這一個版本已經很舊了. 也沒有其他用戶報過類似的問題. 我只能建議你自己先確認一下為什麼有時候nvgpu沒有辦法modprobe.