Installing the NVIDIA Driver and CUDA Toolkit failed

Hi,

I recently started configuring AMD-SEV-SNP with H100 GPU and tried to do some small demos on my machine. Everything went on smoothly except that ‘Installing the NVIDIA Driver and CUDA Toolkit’.

My machine’s specs:

CPU: Dual AMD EPYC 9224
GPU: H100 10de:2331
RAM: 256G
SSD: 2T

Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04 with 6.2.0-39-generic kernel

And,

I have completed the steps until p.25 and ‘Enabling LKCA on the Guest VM’ part on ‘Confidential Computing Deployment Guide

But, in succession, for the Installing the NVIDIA Driver and CUDA Toolkit,

I tried to run $ sudo sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open on the guest VM.

It said it’s failed.

when I looked at /var/log/nvidia-installer.log (full log can be found here)

   Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-modeset.ko due to unavailability of vmlinux
     BTF [M] /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-drm.ko
   Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-drm.ko due to unavailability of vmlinux
     BTF [M] /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia.ko
   Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia.ko due to unavailability of vmlinux
     BTF [M] /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-uvm.ko
   Skipping BTF generation for /tmp/selfgz1940/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-uvm.ko due to unavailability of vmlinux
   make[1]: Leaving directory '/usr/src/linux-headers-6.2.0-39-generic'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[    9.407689] audit: type=1400 audit(1703739996.088:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=616 comm="apparmor_parser"
[    9.407693] audit: type=1400 audit(1703739996.088:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=616 comm="apparmor_parser"
[    9.408083] audit: type=1400 audit(1703739996.088:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=614 comm="apparmor_parser"
[    9.410233] audit: type=1400 audit(1703739996.092:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=612 comm="apparmor_parser"
[    9.410238] audit: type=1400 audit(1703739996.092:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=612 comm="apparmor_parser"
[    9.410242] audit: type=1400 audit(1703739996.092:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=612 comm="apparmor_parser"
[   12.895640] loop3: detected capacity change from 0 to 8
[   13.712095] fbcon: Taking over console
[   13.766104] Console: switching to colour frame buffer device 128x48
[ 2233.703006] VFIO - User Level meta-driver version: 0.3
[ 2233.765797] nvidia: loading out-of-tree module taints kernel.
[ 2233.767753] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2233.831625] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 2233.831637] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
               NVRM: installed in this system is not supported by open
               NVRM: nvidia.ko because it does not include the required GPU
               NVRM: System Processor (GSP).
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the driver README, available on
               NVRM: the Linux graphics driver download page at
               NVRM: www.nvidia.com.
[ 2239.271930] nvidia: probe of 0000:01:00.0 failed with error -1
[ 2239.272052] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 2239.272053] NVRM: None of the NVIDIA devices were initialized.
[ 2239.272992] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

when $ lspci -k on the guest VM it says

cclab@guest:~$ lspci -k
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
	Subsystem: Red Hat, Inc. Device 1100
	Kernel driver in use: bochs-drm
	Kernel modules: bochs
00:02.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI (rev 01)
	Subsystem: Red Hat, Inc. Virtio SCSI
	Kernel driver in use: virtio-pci
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
	Subsystem: Red Hat, Inc. Virtio network device
	Kernel driver in use: virtio-pci
00:04.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
	Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: lpc_ich
	Kernel modules: lpc_ich
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: ahci
	Kernel modules: ahci
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine
	Kernel driver in use: i801_smbus
	Kernel modules: i2c_i801
01:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
	Subsystem: NVIDIA Corporation Device 1626
	Kernel modules: nvidiafb, nouveau

How to fix it?

plus, in that document, i did not see that the guest OS should be align with the same version of the host machine. it’s okay?

Near the top of your install log I see:

An alternate method of installing the NVIDIA driver was detected. (This is usually a package provided by your distributor.) A driver installed via that method may integrate better with your system than a driver installed by nvidia-installer.

Can you try to run a sudo apt-get purge nvidia* on the host and retry the runfile?

I did sudo apt-get purge nvidia* on the host.

But it was failed on the guest VM again…