Hi,
I recently started configuring AMD-SEV-SNP with H100 GPU and tried to do some small demos on my machine. Everything went on smoothly except that ‘Installing the NVIDIA Driver and CUDA Toolkit’.
My machine’s specs:
CPU: Dual AMD EPYC 9224
GPU: H100 10de:2331
RAM: 256G
SSD: 2T
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04 with 6.2.0-39-generic kernel
And,
I have completed the steps until p.25 and ‘Enabling LKCA on the Guest VM’ part on ‘Confidential Computing Deployment Guide’
But, in succession, for the Installing the NVIDIA Driver and CUDA Toolkit,
I tried to run $ sudo sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open
on the guest VM.
It said it’s failed.
when I looked at /var/log/nvidia-installer.log
(full log can be found here)
Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-modeset.ko due to unavailability of vmlinux
BTF [M] /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-drm.ko
Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-drm.ko due to unavailability of vmlinux
BTF [M] /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia.ko
Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia.ko due to unavailability of vmlinux
BTF [M] /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-uvm.ko
Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-uvm.ko due to unavailability of vmlinux
make[1]: Leaving directory '/usr/src/linux-headers-6.2.0-39-generic'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[ 9.407689] audit: type=1400 audit(1703739996.088:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=616 comm="apparmor_parser"
[ 9.407693] audit: type=1400 audit(1703739996.088:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=616 comm="apparmor_parser"
[ 9.408083] audit: type=1400 audit(1703739996.088:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=614 comm="apparmor_parser"
[ 9.410233] audit: type=1400 audit(1703739996.092:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=612 comm="apparmor_parser"
[ 9.410238] audit: type=1400 audit(1703739996.092:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=612 comm="apparmor_parser"
[ 9.410242] audit: type=1400 audit(1703739996.092:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=612 comm="apparmor_parser"
[ 12.895640] loop3: detected capacity change from 0 to 8
[ 13.712095] fbcon: Taking over console
[ 13.766104] Console: switching to colour frame buffer device 128x48
[ 2233.703006] VFIO - User Level meta-driver version: 0.3
[ 2233.765797] nvidia: loading out-of-tree module taints kernel.
[ 2233.767753] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2233.831625] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 2233.831637] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
NVRM: installed in this system is not supported by open
NVRM: nvidia.ko because it does not include the required GPU
NVRM: System Processor (GSP).
NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
NVRM: Firmware' sections in the driver README, available on
NVRM: the Linux graphics driver download page at
NVRM: www.nvidia.com.
[ 2239.271930] nvidia: probe of 0000:01:00.0 failed with error -1
[ 2239.272052] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 2239.272053] NVRM: None of the NVIDIA devices were initialized.
[ 2239.272992] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
when $ lspci -k
on the guest VM it says
cclab@guest:~$ lspci -k
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
Subsystem: Red Hat, Inc. QEMU Virtual Machine
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
Subsystem: Red Hat, Inc. Device 1100
Kernel driver in use: bochs-drm
Kernel modules: bochs
00:02.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI (rev 01)
Subsystem: Red Hat, Inc. Virtio SCSI
Kernel driver in use: virtio-pci
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
Subsystem: Red Hat, Inc. Virtio network device
Kernel driver in use: virtio-pci
00:04.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Kernel driver in use: lpc_ich
Kernel modules: lpc_ich
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Kernel driver in use: ahci
Kernel modules: ahci
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
01:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
Subsystem: NVIDIA Corporation Device 1626
Kernel modules: nvidiafb, nouveau
How to fix it?
plus, in that document, i did not see that the guest OS should be align with the same version of the host machine. it’s okay?