GPU Dropout / Driver Error - NVIDIA bug-report Attached

Issue Description
On a Linux system with NVIDIA GPU(s) (model: NVIDIA GeForce RTX 3090), I am experiencing GPU dropouts and driver errors. After reboot, one or more GPUs sometimes disappear and are not recognized by the system.

System Environment

  • Operating System: Ubuntu 22.04.5 LTS, kernel 6.8.0-88-generic
  • NVIDIA Driver Version: 535.274.02
  • CUDA Version: 11.8 (nvcc release 11.8.89)
  • GPU Model(s) and Count: 8 × NVIDIA GeForce RTX 3090, 24 GB each

Observed Behavior

  1. One or more GPUs occasionally drop out or become invisible to the system after reboot.
  2. Running CUDA or deep learning tasks sometimes triggers errors.
  3. Current nvidia-smi output (all GPUs present):
    Unable to determine the device handle for GPU0000:CA:00.0: Unkown Error

Troubleshooting Steps Already Attempted

  • Rebooted the system
  • Checked GPU status via nvidia-smi

Additional Information

  • NVIDIA Bug Report: nvidia-bug-report.log.gz (attached)

    nvidia_bug_report.log.gz (359.2 KB)

  • Frequency and time of occurrence: intermittent, especially

    after reboot

  • Any other relevant details: sometimes one GPU does not appear in nvidia-smi until a full power cycle

Could NVIDIA engineers or community members provide guidance or suggest further troubleshooting steps? Thank you v