Cuda Initialization error RTX 600 Blackwell Server Edition

Hello, I’ve been having trouble stress testing my RTX 6000s for the past week. I have tried cuda version 12.8,12.9, and 13.0 with no luck. I keep getting the error, No Cuda devices found. I have also tried different stress tests including pytorch, gpu-burn, dcgmi diag, and the phoronix test suite all without any luck. These were tried using ubuntu 24.04, as well as a lightweight version of linux. I have included a screenshot below of what my error message looks like. I have Nvidia open drivers installed as nvidia-smi and nvcc both work. Any help would be greatly appreciated

Hello,

we had exactly the same problem on a different configuration, but with CUDA (12.9/13/13.1), RTX6000 BSE, driver 680.105

OS: Rocky 9.7

Nvidia smi worked but not CUDA.

Finally, what worked:

We disabled the HMM option in nvidia_uvm.

cat >/etc/modprobe.d/nvidia-uvm.conf <<‘EOF’

options nvidia_uvm uvm_disable_hmm=1

EOF

modprobe -r nvidia_uvm

modprobe nvidia_uvm

There you go. I hope this helps you or anyone else who comes across this.

It took us a long time to figure it out…