GPU Crashing During Stress Test – "GPU has fallen off the bus" Error on RTX 4090 with Driver 560.35.03

Hi NVIDIA community,

I’ve been facing consistent issues with my system involving an RTX 4090 (Driver 560.35.03, CUDA Version 12.6) during graphics-heavy operations, particularly while running stress tests. My system specs are as follows:

  • GPU: NVIDIA GeForce RTX 4090
  • Driver Version: 560.35.03
  • CUDA Version: 12.6
  • OS: Ubuntu 22.04.3 (Kernel: 6.8.0-40-generic)
  • OpenGL Version: 4.6 NVIDIA 560.35.03

Problem: Whenever I run a stress test or after a couple of hours and while using glmark2, my GPU eventually crashes, resulting in the error message: “GPU has fallen off the bus”. Here’s a detailed breakdown of what happens:

  1. The stress test runs fine for a few moments, with the FPS reaching high values (1944-2169 FPS).
  2. Partway through, I get a segmentation fault, and the GPU seems to disconnect from the system.
  3. Running nvidia-smi after the crash results in the error: “Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error”.
  4. The kernel logs show several PCIe Bus Errors (severity: Correctable and Non-Fatal), and uncorrectable errors related to the Transaction Layer (Requester ID: 8086).
  5. The logs indicate that the NVIDIA kernel module is unloaded after the crash.

I’ve already tried the following:

  • Reinstalling the drivers downgrading and upgrading them.
  • Checking for BIOS/UEFI settings related to PCIe and adjusting the PCIe slot speed.
  • Verifying the OpenGL configuration.
  • Running the nvidia-bug-report.sh (attached below).

Despite these efforts, the issue persists. Any help or insights on how to resolve this would be greatly appreciated. I’m particularly interested in understanding whether this could be a hardware issue (PCIe interface) or a software/driver-related problem.

Looking forward to any suggestions or advice!

Thanks in advance!

nvidia-bug-report.log (2.4 MB)