Hi NVIDIA community,
I’ve been facing consistent issues with my system involving an RTX 4090 (Driver 560.35.03, CUDA Version 12.6) during graphics-heavy operations, particularly while running stress tests. My system specs are as follows:
- GPU: NVIDIA GeForce RTX 4090
- Driver Version: 560.35.03
- CUDA Version: 12.6
- OS: Ubuntu 22.04.3 (Kernel: 6.8.0-40-generic)
- OpenGL Version: 4.6 NVIDIA 560.35.03
Problem: Whenever I run a stress test or after a couple of hours and while using glmark2
, my GPU eventually crashes, resulting in the error message: “GPU has fallen off the bus”. Here’s a detailed breakdown of what happens:
- The stress test runs fine for a few moments, with the FPS reaching high values (1944-2169 FPS).
- Partway through, I get a segmentation fault, and the GPU seems to disconnect from the system.
- Running
nvidia-smi
after the crash results in the error: “Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error”. - The kernel logs show several PCIe Bus Errors (severity: Correctable and Non-Fatal), and uncorrectable errors related to the Transaction Layer (Requester ID: 8086).
- The logs indicate that the NVIDIA kernel module is unloaded after the crash.
I’ve already tried the following:
- Reinstalling the drivers downgrading and upgrading them.
- Checking for BIOS/UEFI settings related to PCIe and adjusting the PCIe slot speed.
- Verifying the OpenGL configuration.
- Running the
nvidia-bug-report.sh
(attached below).
Despite these efforts, the issue persists. Any help or insights on how to resolve this would be greatly appreciated. I’m particularly interested in understanding whether this could be a hardware issue (PCIe interface) or a software/driver-related problem.
Looking forward to any suggestions or advice!
Thanks in advance!