GPU crashing under moderate load. 3090 + Ubuntu 22.10

I’ve just built a new PC with a 3090, but when the gpu is put under any sort of significant load, the gpu often crashes.

Sometimes the screen freezes and/or goes black and becomes unresponsive, other times it freezes for a few seconds, then becomes really slow, then becomes unresponsive. Occasionally the screen will freeze but the computer will stay responsive enough for me to SSH into the machine and run a few commands.

One of the times that it froze, I was able to log in and run sudo nvidia-bug-report.sh which hung:

nvidia-bug-report.log.gz (120.6 KB)

(I also ran sudo nvidia-bug-report.sh --safe-mode --extra-system-data which was able to complete, but I’m not allowed to link more than 1 file per post…)

I’ve run into this problem with all of the nvidia drivers available on Ubuntu 22.10, and I also ran into the issue on Ubuntu 22.04. I don’t think it is due to overheating because I’ve had it crash when the GPU was <60 C and also when I set the power limit to be only 250 watts.

I’ve noticed that putting pressure on the card while it is running (e.g. with a slightly too large GPU stand) seems to change the behavior, so I thought it could be the PCIe slot. But the GPU is crashing even if I slot it in another PCIe slot on the motherboard.

Does anyone have and ideas what could be causing this? Is the GPU just bad?

Here is a link to the bug report run with sudo nvidia-bug-report.sh --safe-mode --extra-system-data since I couldn’t include it in the original post above:

nvidia-bug-report_safe-extra.log.gz (126.3 KB)

Hello,

I am having the same issue. Did you eventually manage to solve it?

@gpphlipot
Please help to share reliable repro steps and repro frequency so that I can try locally to reproduce issue first which will help in debugging.