Technical Issues with RTX 3060 (hardware, maybe clock?)

Hi,
I have been using a RTX 3060 for a couple of years now. It started having some issues a while ago, and most likely I did not pay too much attention. Lately I had some crashes, mostly with a frozen pc while looping the last 2-3 seconds (of audio/video if I was using).

Similar issues are:

This happened to me sometimes, but not too much. I was doing more intense computations (DL-related cuda training or inference) and it seemed reasonable to have a couple of crash here and there. In the last days it got more often, and yesterday it happened twice without even using the GPU too much (normal browsing). It’s hard to say if this is due to the updates or to hardware damage or to some misconfigurations.

Now I just detached the GPU and I am running the pc in the embedded intel graphic card (I see no crashes now).

What I tried:

  • updating CUDA drivers
  • monitoring tempertaure (it reached around 60-70 degrees, but never 90, or it reached in the last minute before the crash, but it is unlikely)
  • manually setting FAN speed (this seemed to help, but if I forgot after testing that the pc is actually increasing the speed when temperature increases, it crashed). I cannot tell if the crash is related to the temperature.

From looking around, I can see it could be a mis-alingment between the motherboard and the GPU, some kind of overclocking? I bought a customized computer (did not customized it myself, from an online shop).
I am running Ubuntu 20.04, and the GPUz software that is recommended around works only on Windows, so I wanted to ask what could be the best way to test the GPU for physical damages or to test for some misconfiguration of the motherboard/GPU and similar.

I run nvidia-bug-report now (after detaching the GPU) to collect information about the installed drivers, but of course it does not find the GPU.
The specifics of my pc are:

  • CPU: Intel 11th Gen Core i7-11700 (8 core, 16 threads)
  • RAM: 32GB DDR4 SDRAM
  • Mobo: Gigabyte B560 HD3
  • Storage: SSD + HDD
  • OS: Ubuntu 20.04 with Kernel Linux 5.15.0-116-generic

My question would be: what could be the next step?

  • use Windows to run GPUz?
  • reattach the NVIDIA card and re-run nvidia-bug-report?

What else could I do to understand better where the real issue is?
Thanks a lot in advance

nvidia-bug-report.log.gz (188.4 KB)
(this is after the last crash, without GPU attached. But it shows info about installed drivers)

I reattached the GPU and re-ran the bug-report if this helps
nvidia-bug-report-wGPU.log.gz (375.3 KB)

UPDATE:

I tested the GPU using Furmark (FurMark Homepage) and run two 10 minutes test at low (1080p) and high (4k) resolution, everything went fine. It did not overheat (reached peaks of 69 and then stable around 65-67 degrees celsius) and absolutely no issues.
I still do not understand where the crashes (of course today no more crashes as I report) comes from.

UPDATE 2 (last):

I tried using other software and everything was running fine, then it crashed again while browsing a page (inspecting a 3D model, but the heavy computation was already done). I kept always on top nvidia-setting, temperature was stable at 61C and there was no warning sign about anything. So it might be from somewhere else (some background program, some power shortage?)
I am still clueless.
Thanks in advance