I’m on an updated Kubuntu 20.04 system with an Nvidia 3090 graphics card. I have a recurring problem with this video card hanging up randomly. I used to have the issue be a black screen, here is the previous post regarding this:
Now, it simply freezes the current screen and plays that last 2/3 seconds of any sound in a loop. Everything freezes including mouse and keyboard and only a hard reset fixes the issue until it re-occurs again at a later time.
Looking through the logs, I don’t see any error messages being logged anywhere, so I am at a loss as to what could be causing this problem. Could this be a hardware problem? Should I try to RMA the card? It has honestly been very problematic with this freezing/black screen issue.
Any input is greatly appreciated. Thank you.
My specs are:
Product Name : NVIDIA GeForce RTX 3090
Driver Version : 495.29.05
CUDA Version : 11.5
Memory : 32 GB RAM
SSD Drive : 2 TB, over 1 TB free
CPU : Intel Core i7 10700k
OS: : Ubuntu 20.04 nvidia-bug-report_2-4-2021.tar.gz (263.5 KB)
I have set the memory clock profile to XMP 1 on my Asus motherboard, which bumps it up to 3600MHz. These are the memory modules that I have on my system:
Then you’re heavily overclocking the memory controller of your cpu. Which would explain the XID32 errors you had previously. Rather set clocks to something near the cpu stock 2933 clocks and check if the system runs stable
Though the nvidia-smi output is a bit concerning. Please create a new nvidia-bug-report.log after reducing memory clocks.
I just rebooted, disabled XMP memory overclocking, and ran nvidia-bug-report again. nvidia-bug-report.tar.gz (266.1 KB)
Also, wanted to share this video of the power draw when running a deep learning training job on the GPU. It goes from over 500 Watts to less than 300 Watts in a matter of seconds. After a few minutes, the whole system shuts off. It’s yet another issue I’ve been having, here is a video showing the power draw readings on my UPS:
From observation, DL training jobs are extreme workloads, especially producing power spikes. The shutdown is initiated by the psu, common issue. Can be worked around by using nvidia-smi -lgc to limit clocks so the gpu doesn’t boost or fixed by using a better psu.
nvidia-smi still doesn’t report any gpu usage and no gpu clocks, so there’s definitely something wrong with the gpu though I guess the crashes resulted from the memory clocks. Rather check if you can get an RMA.