Hello everyone,
We are experiencing an issue with our single H100 NVL GPU when using confidential computing. We have followed every step outlined in this deployment guide for AMD SEV. Despite this, we consistently encounter the error 802 cudaErrorSystemNotReady, even though all system components appear to be functioning correctly.
Initially, we were running Arch Linux and we also tried using the 550 driver, but the error persists regardless of these changes.
Below are the relevant logs. Please let me know if you need any additional information—I’m happy to provide more details as needed:
lukas@ubuntu-sev:~$ sudo dmesg | grep -i nvidia
[ 6.138339] nvidia: loading out-of-tree module taints kernel.
[ 6.138351] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 6.207574] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 6.208920] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[ 6.230239] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 570.172.08 Release Build (dvs-builder@U22-I3-AF01-21-3) Tue Jul 8 18:08:21 UTC 2025
[ 6.295743] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.172.08 Release Build (dvs-builder@U22-I3-AF01-21-3) Tue Jul 8 17:59:47 UTC 2025
[ 6.306143] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 10.268110] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 10.268131] nvidia 0000:01:00.0: [drm] No compatible format found
[ 10.268135] nvidia 0000:01:00.0: [drm] Cannot find any crtc or sizes
lukas@ubuntu-sev:~$ sudo dmesg | grep -i sev
[ 1.295817] Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
[ 1.413596] SEV: APIC: wakeup_secondary_cpu() replaced with wakeup_cpu_via_vmgexit()
[ 1.533778] SEV: Using SNP CPUID table, 29 entries present.
[ 1.924609] SEV: SNP guest platform device initialized.
[ 5.080549] systemd[1]: Hostname set to <ubuntu-sev>.
[ 5.937308] sev-guest sev-guest: Initialized SEV guest driver (using vmpck_id 0)
[ 6.101569] kvm_amd: KVM is unsupported when running as an SEV guest
lukas@ubuntu-sev:~$ nvidia-smi
Tue Aug 5 07:17:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:01:00.0 Off | 0 |
| N/A 48C P0 63W / 350W | 23MiB / 95830MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
lukas@ubuntu-sev:~$ nvidia-smi conf-compute -f
CC status: ON
lukas@ubuntu-sev:~$ ps -aux | grep nvidia-persistenced
root 859 3.6 0.0 6908 2444 ? Ss 07:17 0:02 /usr/bin/nvidia-persistenced --verbose --uvm-persistence-mode
lukas 1229 0.0 0.0 6680 2304 pts/0 S+ 07:18 0:00 grep --color=auto nvidia-persistenced
lukas@ubuntu-sev:~$ ./cuda-samples/build/Samples/0_Introduction/matrixMul/matrixMul
[Matrix Multiply Using CUDA] - Starting...
CUDA error at /home/lukas/cuda-samples/Samples/0_Introduction/matrixMul/../../../Common/helper_cuda.h:807 code=802(cudaErrorSystemNotReady) "cudaGetDeviceCount(&device_count)"
The host is running Ubuntu 25.04, as in the guide - the vm is running Ubuntu 24.04.2 LTS. If we disable CC, we are able to use CUDA in the virtual machine.
Thank you