We have a h/w setup with multiple H100s ( from lspci -d 10de:
). I have setup one GPU to passthrough to my Qemu/KVM VM. After installing drivers on the guest, I can see that it is attached:
From nvidia-smi
in the guest, I can see the single GPU I attached.
name, pci.bus_id, vbios_version, driver_version
NVIDIA H100 80GB HBM3, 00000000:01:00.0, 96.00.61.00.01, 550.54.15
CUDA Version is 12.4
, Driver Version: 550.54.15
MIG is disabled, Persistence-M is Off
.
Guest is on Ubuntu 22.04.
But I still get errors trying to run some cuda samples directly or via pytorch.
>>> import os,torch
>>> torch.cuda.is_available()
.../python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Setting the following env variables help return is_avaialble() as True, but fails in the next one:
>>> os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
>>> os.environ["CUDA_VISIBLE_DEVICES"]="0"
>>> os.environ["PYTORCH_NVML_BASED_CUDA_CHECK"]="1"
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
...
File ".../site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
Thank You