GPU passthrough on HPE Cray XD670

Hi,

I’m trying to setup passthrough on this beast. I’ve used libvirt earlier on an older Apollo 6500 (8xA100). This time I would like to try to get it working on this one with Proxmomx VE 8.x.

Tested lots of guides. The issue is that the driver loads and detects the GPU but dcgm fails on all but level 1 tests with a failure on cudaDeviceGetByPCI “system not yet initialized”

Not really sure where to start troubleshooting, the host or the VM.

-tested to pass 1 GPU only
-tested to pass everything including NVSwitch

Using driver 555.42.02 with cuda 12.6

Any suggestions?

Disabling NVLink on guest OS gave me some hope but I can only get the VM to work with 1 GPU, adding more it crash after running nvidia-smi.

I’ve setup 2 identical Cray hosts in parallel 1 with proxmox and the other with RHEL9 libvirt. The RHEL9 system seems to work with multiple GPUs so far but no luck with proxmox.