Error running cuda on VM with GPU passthrough. cuda.get_device_name() returns 802, not initialized

We have a h/w setup with multiple H100s ( from lspci -d 10de:). I have setup one GPU to passthrough to my Qemu/KVM VM. After installing drivers on the guest, I can see that it is attached:
From nvidia-smi in the guest, I can see the single GPU I attached.

name, pci.bus_id, vbios_version, driver_version
NVIDIA H100 80GB HBM3, 00000000:01:00.0, 96.00.61.00.01, 550.54.15

CUDA Version is 12.4, Driver Version: 550.54.15
MIG is disabled, Persistence-M is Off.
Guest is on Ubuntu 22.04.
But I still get errors trying to run some cuda samples directly or via pytorch.

>>> import os,torch
>>> torch.cuda.is_available()
.../python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

Setting the following env variables help return is_avaialble() as True, but fails in the next one:

>>> os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
>>> os.environ["CUDA_VISIBLE_DEVICES"]="0"
>>> os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"]="1"
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
...
  File ".../site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

Thank You

1 Like

I’m having the same issue, very similar setup:
VM OS: Ubuntu 22.4
NVIDIA-SMI 550.90.07
Driver Version: 550.90.07
CUDA Version: 12.4
GPU: 1 H100 SXM

>>> import torch
>>> torch.cuda.is_available()
/home/ubuntu/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

I’ve seen others say to install fabric manager but that should not be the case with 1 GPU. I’ve tried restarting and that did not solve it either.

Any help would be very appreciated!

Have you fixed this issue?

If fixed, how to do that?

Hi, have you solved yet? what’s the root cause?

Any updates? I’m having the same issue.

Install NVIDIA driver and related nvidia-fabricmanager package in the physical host(which has GPUs). It works for me.

I have resolved this issue. My SXM H20 machine is equipped with 8 graphics cards, 7 of which are passed through to the QEMU/KVM virtual machine.

Install the graphics card driver and the corresponding version of Fabric Manager on the host machine and ensure that Fabric Manager runs properly. After installing the driver for the graphics cards inside the virtual machine, you can use CUDA without any issues.

It should be noted that when using PyTorch, there is still a chance of encountering the error Error 802: system not yet initialized. You only need to add the line torch.cuda.empty_cache() right after importing torch in the Python code, and PyTorch will then work normally.

Alternatively, you can also pass NVLink through to the virtual machine and install Fabric Manager inside it, which will also ensure smooth operation. ——a Chinese engineer