cudaGetDeviceCount returned 3 -> initialization error, CUDA 13.0, RHEL 9, HGX B200

Hello,

I’m currently facing an issue with setting up a B200 cluster and would like to ask for some guidance.

I’m using a system with HGX 8xB200, with the following software versions:

  • OS: RHEL 9 (Red Hat Enterprise Linux 9)
  • NVIDA Driver: 580.105.08
  • NVIDIA Fabric Manager: 580.105.08
  • CUDA Toolkits: cuda_13.0

Here’re some output to verify installation of the above:


However, I encountered an initialization error when running ./deviceQuery:

./deviceQuery
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL


Additionally, to start the nvidia-fabricmanager I will have to manually load ib_umad module. Otherwise, it will fail to start the service. Is this normal? Is there a way to make ib_umad to automatically load at reboot?

Any insight or recommendations would be greatly appreciated.
Thank you!

Hi,

We managed to resolve the issue by the follows:

  1. lowering the CUDA version to 12.8
  2. Disabling kaslr grubby --update-kernel=ALL --args=nokaslr
  3. Installing doca_ofed for automatic loading ib_umad!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.