We have a h/w setup with multiple H100s ( from lspci -d 10de:). I have setup one GPU to passthrough to my Qemu/KVM VM. After installing drivers on the guest, I can see that it is attached:
From nvidia-smi in the guest, I can see the single GPU I attached.
CUDA Version is 12.4, Driver Version: 550.54.15
MIG is disabled, Persistence-M is Off.
Guest is on Ubuntu 22.04.
But I still get errors trying to run some cuda samples directly or via pytorch.
>>> import os,torch
>>> torch.cuda.is_available()
.../python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Setting the following env variables help return is_avaialble() as True, but fails in the next one:
>>> os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
>>> os.environ["CUDA_VISIBLE_DEVICES"]="0"
>>> os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"]="1"
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
...
File ".../site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
I’m having the same issue, very similar setup:
VM OS: Ubuntu 22.4
NVIDIA-SMI 550.90.07
Driver Version: 550.90.07
CUDA Version: 12.4
GPU: 1 H100 SXM
>>> import torch
>>> torch.cuda.is_available()
/home/ubuntu/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
I’ve seen others say to install fabric manager but that should not be the case with 1 GPU. I’ve tried restarting and that did not solve it either.
I have resolved this issue. My SXM H20 machine is equipped with 8 graphics cards, 7 of which are passed through to the QEMU/KVM virtual machine.
Install the graphics card driver and the corresponding version of Fabric Manager on the host machine and ensure that Fabric Manager runs properly. After installing the driver for the graphics cards inside the virtual machine, you can use CUDA without any issues.
It should be noted that when using PyTorch, there is still a chance of encountering the error Error 802: system not yet initialized. You only need to add the line torch.cuda.empty_cache() right after importing torch in the Python code, and PyTorch will then work normally.
Alternatively, you can also pass NVLink through to the virtual machine and install Fabric Manager inside it, which will also ensure smooth operation. ——a Chinese engineer