Hello NVIDIA Community,
I’m encountering a CUDA initialization issue on my Dell PowerEdge XE9680 server and would greatly appreciate any assistance in resolving it.
System Configuration:
- Server Model: Dell PowerEdge XE9680
- Operating System: Ubuntu 24.04
- NVIDIA Driver Version: 560.x
- CUDA Version: 12.6
- cuDNN Version: (Specify the version if applicable)
- GPUs: (Specify the number and type of GPUs installed)
Problem Description:
After installing the NVIDIA driver and CUDA 12.6, I attempted to run the deviceQuery
sample from the CUDA toolkit to verify that everything is set up correctly. Unfortunately, I’m getting the following error:
rust
Copy code
cudaGetDeviceCount returned 802
-> system not yet initialized
Result = FAIL
This error suggests that CUDA is unable to initialize the GPUs, but I haven’t been able to pinpoint the cause.
Troubleshooting Steps I’ve Tried:
- Verified the NVIDIA Driver Installation:
- Ran
nvidia-smi
to ensure the driver is installed and recognized the GPUs correctly. Everything appears normal in the output.
- Reinstalled CUDA and the NVIDIA Drivers:
- I’ve uninstalled and reinstalled both CUDA 12.6 and the NVIDIA driver to rule out any installation issues.
- Checked for Compatibility:
- Confirmed that CUDA 12.6 is compatible with the NVIDIA driver version 560.x.
- Reset the NVIDIA Driver:
- Stopped and restarted the
nvidia-persistenced
service and reloaded the NVIDIA kernel modules.
- Rebuilt the Initramfs:
- Rebuilt the initramfs and rebooted the system to ensure all changes take effect.
- Checked Kernel Modules:
- Verified that the
nvidia
kernel modules are correctly loaded usinglsmod | grep nvidia
.
Request for Assistance:
Despite these efforts, the issue persists. I would greatly appreciate any insights or suggestions you can provide on how to resolve this issue.
- Are there specific logs or diagnostic steps that I should check to identify why CUDA is failing to initialize?
- Could this be related to the hardware configuration on the Dell PowerEdge XE9680, or is it more likely to be a software issue?
- Has anyone else experienced similar issues with this or similar setups?
Thank you in advance for your help!