Hi,
My setup is a Linux, Ubuntu 24.04.4 LTS, linux kernel 6.17.0-14-generic with the NVIDIA GeForce RTX 4090 with a DP to a monitor and an NVIDIA GeForce RTX 5090 D for compute only. No bridge (data cable) between them as computations run independently.
For over 6 months the system was stable and ran more-or-less 24/7 with CUDA C computation without any problems. Suddenly 2 weeks ago, the system starts crashing up to 3 times a day, but then a period of 12 days with no crash and then 3 crashes in one day. The kernel reports the āGPU has fallen off the busā, which gives a black screen and Xorg (2:21.1.12-1ubuntu1.5 amd64). The system still runs and I can ssh to the system and run nvidia-bug-report.sh, which hangs. It suggests to use the arguments āsafe-mode and āextra-system-data which I appended.
The crashes happen also in idle state (P8), so this makes a power issue less likely?
The problem happen both with linux kernel 6.17 and linux 6.14 and also with NVIDIA driver nvidia-driver-580-open 580.126.09-0ubuntu0.24.04.1 amd64 as well as with nvidia-driver-590-open 590.48.01-0ubuntu0.24.04.1 amd64.
I tried also to turn off ReSize BAR in the BIOS, but that made no difference.
I tried different combinations of HDMI cables vs DP cables, output from the other card, but the system still crashes.
When the screen output is from the 5090, the systems hangs, and I cannot run the bug report script.
Finally, it is presumably always the 4090 that crashes, since its fan goes to max speed, Xorg is busy at 100% CPU and nvtop still sees the RTX 5090, although the kernel crash also kills the computations on the 5090.
The machine specs are:
NVIDIA GeForce RTX 4090 (DP monitor output)
NVIDIA GeForce RTX 5090 D (no output)
Mainboard:ASUS PRIME Z790-P
CPU: Intel® Core⢠i7-14700KF à 28
Memory: 64.0 GiB Asgaard at both 5500 and 4200 MHz (no difference)
Xorg 2:21.1.12-1
64bit ubuntu 24.04.4 LTS
Can I determine whether this is a power issue or a driver issue or something else?
I attached the bug report, which has more info.
nvidia-bug-report.log.gz (224.3 KB)
nvidia-bug-report.log.old.gz (153.6 KB)
Any help is highly appreciated!
Sincerely,
/sbgudnason