Hi,
I have been using a RTX 3060 for a couple of years now. It started having some issues a while ago, and most likely I did not pay too much attention. Lately I had some crashes, mostly with a frozen pc while looping the last 2-3 seconds (of audio/video if I was using).
Similar issues are:
This happened to me sometimes, but not too much. I was doing more intense computations (DL-related cuda training or inference) and it seemed reasonable to have a couple of crash here and there. In the last days it got more often, and yesterday it happened twice without even using the GPU too much (normal browsing). It’s hard to say if this is due to the updates or to hardware damage or to some misconfigurations.
Now I just detached the GPU and I am running the pc in the embedded intel graphic card (I see no crashes now).
What I tried:
- updating CUDA drivers
- monitoring tempertaure (it reached around 60-70 degrees, but never 90, or it reached in the last minute before the crash, but it is unlikely)
- manually setting FAN speed (this seemed to help, but if I forgot after testing that the pc is actually increasing the speed when temperature increases, it crashed). I cannot tell if the crash is related to the temperature.
From looking around, I can see it could be a mis-alingment between the motherboard and the GPU, some kind of overclocking? I bought a customized computer (did not customized it myself, from an online shop).
I am running Ubuntu 20.04, and the GPUz software that is recommended around works only on Windows, so I wanted to ask what could be the best way to test the GPU for physical damages or to test for some misconfiguration of the motherboard/GPU and similar.
I run nvidia-bug-report now (after detaching the GPU) to collect information about the installed drivers, but of course it does not find the GPU.
The specifics of my pc are:
- CPU: Intel 11th Gen Core i7-11700 (8 core, 16 threads)
- RAM: 32GB DDR4 SDRAM
- Mobo: Gigabyte B560 HD3
- Storage: SSD + HDD
- OS: Ubuntu 20.04 with Kernel Linux 5.15.0-116-generic
My question would be: what could be the next step?
- use Windows to run GPUz?
- reattach the NVIDIA card and re-run nvidia-bug-report?
What else could I do to understand better where the real issue is?
Thanks a lot in advance
nvidia-bug-report.log.gz (188.4 KB)
(this is after the last crash, without GPU attached. But it shows info about installed drivers)