Error running cuda on VM with GPU passthrough. cuda.get_device_name() returns 802, not initialized

We have a h/w setup with multiple H100s ( from lspci -d 10de:). I have setup one GPU to passthrough to my Qemu/KVM VM. After installing drivers on the guest, I can see that it is attached:
From nvidia-smi in the guest, I can see the single GPU I attached.

name, pci.bus_id, vbios_version, driver_version
NVIDIA H100 80GB HBM3, 00000000:01:00.0, 96.00.61.00.01, 550.54.15

CUDA Version is 12.4, Driver Version: 550.54.15
MIG is disabled, Persistence-M is Off.
Guest is on Ubuntu 22.04.
But I still get errors trying to run some cuda samples directly or via pytorch.

>>> import os,torch
>>> torch.cuda.is_available()
.../python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

Setting the following env variables help return is_avaialble() as True, but fails in the next one:

>>> os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
>>> os.environ["CUDA_VISIBLE_DEVICES"]="0"
>>> os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"]="1"
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
...
  File ".../site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

Thank You

1 Like

I’m having the same issue, very similar setup:
VM OS: Ubuntu 22.4
NVIDIA-SMI 550.90.07
Driver Version: 550.90.07
CUDA Version: 12.4
GPU: 1 H100 SXM

>>> import torch
>>> torch.cuda.is_available()
/home/ubuntu/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

I’ve seen others say to install fabric manager but that should not be the case with 1 GPU. I’ve tried restarting and that did not solve it either.

Any help would be very appreciated!

Have you fixed this issue?

If fixed, how to do that?

Hi, have you solved yet? what’s the root cause?

Any updates? I’m having the same issue.