Verifying Kata Manager, Confidential Computing Manager, and VFIO Manager FAILED

My machine’s specs:

CPU: Dual AMD EPYC 9224 16-Core Processor
GPU: H100 10de:2331 (vbios: 96.00.5E.00.03 cuda: 12.2 nvidia driver: 535.86.10)
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

Following by the deployment document, I succeeded unitl p.39.

But, when I tried to install NVIDIA GPU operator (3. Install Operator on the p. 40), I faced some error as follows.

cclab@guest:~$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS        AGE
gpu-operator-1704448302-node-feature-discovery-gc-5785d845tnkb9   1/1     Running                 0               36m
gpu-operator-1704448302-node-feature-discovery-master-7464275cx   1/1     Running                 0               36m
gpu-operator-1704448302-node-feature-discovery-worker-bdbkv       1/1     Running                 0               36m
gpu-operator-d7467c67f-bfrxd                                      1/1     Running                 0               36m
nvidia-kata-manager-x74m4                                         0/1     Running                 0               36m
nvidia-sandbox-device-plugin-daemonset-wcj79                      0/1     Init:0/2                0               21m
nvidia-sandbox-validator-n9rcj                                    0/1     Init:CrashLoopBackOff   9 (96s ago)     22m
nvidia-vfio-manager-2mtfv                                         0/1     Init:CrashLoopBackOff   7 (3m44s ago)   22m

it seems that nvidia-* pods cannot be deployed in the cluster.

and I can see the describe about the pod
kubectl describe pod nvidia-sandbox-validator-n9rcj -n gpu-operator

it says

  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  23m                   default-scheduler  Successfully assigned gpu-operator/nvidia-sandbox-validator-n9rcj to guest
  Normal   Pulled     23m                   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    23m                   kubelet            Created container cc-manager-validation
  Normal   Started    23m                   kubelet            Started container cc-manager-validation
  Normal   Pulled     22m (x5 over 23m)     kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    22m (x5 over 23m)     kubelet            Created container vfio-pci-validation
  Normal   Started    22m (x5 over 23m)     kubelet            Started container vfio-pci-validation
  Warning  BackOff    3m36s (x94 over 23m)  kubelet            Back-off restarting failed container vfio-pci-validation in pod nvidia-sandbox-validator-n9rcj_gpu-operator(12ecb940-e6ae-4f5a-9235-6cb0afdfdd5d)

on the host,

(host) $ lspci -nnk -d 10de:
44:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2331] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1626]
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau

how to fix it?