CUDA Initialization Issue: cudaGetDeviceCount returned 802 on Dell PowerEdge XE9680 with NVIDIA Driver 560.x and CUDA 12.6

Hello NVIDIA Community,

I’m encountering a CUDA initialization issue on my Dell PowerEdge XE9680 server and would greatly appreciate any assistance in resolving it.

System Configuration:

  • Server Model: Dell PowerEdge XE9680
  • Operating System: Ubuntu 24.04
  • NVIDIA Driver Version: 560.x
  • CUDA Version: 12.6
  • cuDNN Version: (Specify the version if applicable)
  • GPUs: (Specify the number and type of GPUs installed)

Problem Description:

After installing the NVIDIA driver and CUDA 12.6, I attempted to run the deviceQuery sample from the CUDA toolkit to verify that everything is set up correctly. Unfortunately, I’m getting the following error:

rust

Copy code

cudaGetDeviceCount returned 802
-> system not yet initialized
Result = FAIL

This error suggests that CUDA is unable to initialize the GPUs, but I haven’t been able to pinpoint the cause.

Troubleshooting Steps I’ve Tried:

  1. Verified the NVIDIA Driver Installation:
  • Ran nvidia-smi to ensure the driver is installed and recognized the GPUs correctly. Everything appears normal in the output.
  1. Reinstalled CUDA and the NVIDIA Drivers:
  • I’ve uninstalled and reinstalled both CUDA 12.6 and the NVIDIA driver to rule out any installation issues.
  1. Checked for Compatibility:
  • Confirmed that CUDA 12.6 is compatible with the NVIDIA driver version 560.x.
  1. Reset the NVIDIA Driver:
  • Stopped and restarted the nvidia-persistenced service and reloaded the NVIDIA kernel modules.
  1. Rebuilt the Initramfs:
  • Rebuilt the initramfs and rebooted the system to ensure all changes take effect.
  1. Checked Kernel Modules:
  • Verified that the nvidia kernel modules are correctly loaded using lsmod | grep nvidia.

Request for Assistance:

Despite these efforts, the issue persists. I would greatly appreciate any insights or suggestions you can provide on how to resolve this issue.

  • Are there specific logs or diagnostic steps that I should check to identify why CUDA is failing to initialize?
  • Could this be related to the hardware configuration on the Dell PowerEdge XE9680, or is it more likely to be a software issue?
  • Has anyone else experienced similar issues with this or similar setups?

Thank you in advance for your help!

The issue is improper installation of the NVIDIA Fabric Manager which is mandatory for NVSwitch systems.

That worked! Thank you @Robert_Crovella

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.