We’re configuring the virtualization software for the customer for HGX B200 8-GPU. Our stack utilizes VFIO / KVM / Qemu for direct GPU passthrough. We run into the following issue with this system:
-
The VM is created, and the GPU is successfully passed through and visible with lspci.
-
nvidia-smi: Shows Driver Version: 570.172.08 for the B200. Persistence mode is “On”.
-
deviceQuery: Reports CUDA Error: initialization error (code 3).
-
We consistently see the following error in the dmesg
NVRM: kbifCacheVFInfo_GB100: Unable to read NV_PF0_INITIAL_AND_TOTAL_VFS
NVRM: calculatePCIELinkRateMBps: Unknown PCIe speed
NVRM: getPCIELinkRateMBps: Generic Error: Invalid state [NV_ERR_INVALID_STATE]
[drm] [nvidia-drm] [GPU ID 0x00000010] Failed to allocate NvKmsKapiDevice
Is this a known issue? What would be the fastest way to resolve the problem, i.e., minimizing the amount of changes necessary in our software?