zhahir
March 20, 2025, 9:29am
1
NVIDIA kernel message error when boot.
System:
Linux Gentoo (Profile Clang) (As VM)
Compiler Clang-19
Kernel 6.12.16 (refer Gentoo Wiki for kernel mod-NVIDIA)
NVIDIA Open Kernel 570.124.06
GPU model: H100 PCie
VMWare ESXi v8.0.2
Problems
NVIDIA modules built success but system crash when run nvidia-smi.
Run nvidia-persistenced and instantaneously killed, but modprobe process/PID got 100% CPU, cannot kill/SIGKILL.
Below dmesg output:
[ 9.271034] nvidia: loading out-of-tree module taints kernel.
[ 9.352730] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 9.353009] #PF: supervisor instruction fetch in kernel mode
[ 9.353199] #PF: error_code(0x0010) - not-present page
[ 9.353380] PGD 0 P4D 0
[ 9.353564] Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
[ 9.353745] CPU: 4 UID: 0 PID: 1294 Comm: (udev-worker) Tainted: G O 6.12.16-gentoo #9
[ 9.353927] Tainted: [O]=OOT_MODULE
[ 9.354099] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform , BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023
[ 9.354282] RIP: 0010:0x8
[ 9.354464] Code: Unable to access opcode bytes at 0xffffffffffffffde.
# end with this
[ 9.370731] note: (udev-worker)[1294] exited with irqs disabled
It’s error occured from hypervisor? or
GPU state in ‘suspend’ cannot resume? or
Compiler version on kernel and modules? or anything else i miss.
Help me figure this out.
Thx
I got a similar kernel OOPS with the 570 driver on my Desktop on Gentoo. However, I use the proprietary kernel modules:
This one seems to be the same problem:
Will try the 570.133.07 this weekend.
1 Like
zhahir
March 25, 2025, 10:21am
3
Need to fix this linker (Clang with LLD):
cat /proc/driver/nvidia/version -v
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 570.124.06 Release Build (root@gentoo92) Tue Mar 25 06:13:44 PM -00 2025
GCC version: clang: error: linker command failed with exit code 1 (use -v to see invocation)
SOLVED
I’m supposed to use grid drivers and use the correct compiler FLAGS.
Use vGPU driver from Nvidia Grid drivers from https://cloud.google.com/compute/docs/gpus/grid-drivers-table
For Clang export necessary compiler flags.
Compile CUDA Toolkit 12.4 using GCC-13 (in Gentoo styles)
#: nvidia-smi
Thu May 8 13:53:08 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100-4C Off | 00000000:03:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 1MiB / 4096MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Cuda Toolkit
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
system
Closed
May 27, 2025, 4:07am
5
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.