RTX 3090 GPU crashes (EDIT: power supply issue)

Initially, I experienced this as a full desktop GPU crash (i.e., GPU drops off the PCIe bus) in a Linux game, but it became apparent that the issue occurs with many Vulkan apps.

This machine is also used for CUDA development, and there seems to be no problem at all with CUDA or OpenGL. For example, the nbody CUDA example can be run in full screen or by resizing with no problem. It seems to be specific to Vulkan and related to the compositor. It also seems to be specific to Ubuntu 22.04.4; prior releases did not have stability problems with either CUDA or Vulkan.

EDIT: turned out to be an issue with the system power supply. The Vulkan driver will adjust the GPU clock rates according to the load, whereas CUDA will not. So a sudden transient load probably exacerbated the issue enough to cause the GPU to drop off the PCIe bus.

I have also attached the nvidia-installer.log. There is a warning that another driver installation method was detected, but per the steps above, there are no other NVIDIA or Nouveau drivers present, though the CUDA APT repository is still present.
nvidia-bug-report.log.gz (168.6 KB)
nvidia-installer.log (42.7 KB)

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1200

1 Like

Thank you!

I had actually already done a quick spot-check and experienced the same issue by with “nvidia-smi -lgc 0,1395”, temperatures always between 30-65 C, depending on idle fan settings. And then I had done a quick check with a second power supply, but experienced the same problem.

Well, you were nevertheless correct. I was able to reproduce this on Windows with DirectX 12, so I’m pretty sure I can close this!

Seems to have either been a slightly loose power or ground pin in the system, or some issue with the motherboard. After reconnecting everything, the issue is no longer occurring on Linux or Windows.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.