Issue Description
On a Linux system with NVIDIA GPU(s) (model: NVIDIA GeForce RTX 3090), I am experiencing GPU dropouts and driver errors. After reboot, one or more GPUs sometimes disappear and are not recognized by the system.
System Environment
- Operating System: Ubuntu 22.04.5 LTS, kernel 6.8.0-88-generic
- NVIDIA Driver Version: 535.274.02
- CUDA Version: 11.8 (nvcc release 11.8.89)
- GPU Model(s) and Count: 8 × NVIDIA GeForce RTX 3090, 24 GB each
Observed Behavior
- One or more GPUs occasionally drop out or become invisible to the system after reboot.
- Running CUDA or deep learning tasks sometimes triggers errors.
- Current
nvidia-smioutput (all GPUs present):
Unable to determine the device handle for GPU0000:CA:00.0: Unkown Error
Troubleshooting Steps Already Attempted
- Rebooted the system
- Checked GPU status via
nvidia-smi
Additional Information
-
NVIDIA Bug Report:
nvidia-bug-report.log.gz(attached)nvidia_bug_report.log.gz (359.2 KB)
-
Frequency and time of occurrence: intermittent, especially
after reboot
-
Any other relevant details: sometimes one GPU does not appear in
nvidia-smiuntil a full power cycle
Could NVIDIA engineers or community members provide guidance or suggest further troubleshooting steps? Thank you v