A4000 crashes constantly

Hi!

I bought a used A4000 card 2 months ago. According to the seller fully operational - he gave me a 3 month warranty on it. I have had major problems with it since the beginning, from my point of view it is defective.
At the beginning the fan was not communicating with the controller and the card was running all the time with the fan at 100% speed. The seller fixed this but then the main problem appeared:

constant crashes of the card under load.

When the dealer sent the card back to me after fixing the fan problem it started to show a lot of artefacts during testing (followed by a crash). I sent it back but the vendor said he had tested the card for two days and had not noticed any problems.

I notice.

When the card is cold and I run e.g. Furmark or Cinebench the card works for 5-10 minutes until the temperatures of the CPU are raised to about 85 degrees and the memory to about 90 degrees, then the graphics driver crashes - artifacts and then the driver restarts and very often no further work is possible (flaws in the interface, black windows). Sometimes after a reboot the computer hangs before showing the login window, only another reboot brings results.

Subsequent load tests cause a fast crash (after 1 - 2 minutes, with symptoms as above). After the first crash, the system sometimes crashes already at completely random moments (e.g. during video streaming).

After some crashes (especially when I used the 2021 drivers installed automatically by Windows), the system would show normal desktop and display a message window: Unable to recover from a kernel exception. The application must close. Error code: 3 (subcode 2).

The computer usually (but also not always) works normally when not under load or when under load without reaching high temperatures (normal system operation, use of 2D graphics applications). I also tested the system with a 3D TPP game - sooner or later a crash would occur (artifacts, reboot).

I have tested the card on 2 different computers (tomorrow I will test on a third - but I don’t expect different results). I used various drivers (uninstallation via DDU) - latest and older from NVIDIA, latest and older from the card supplier (HP), Microsoft suggested drivers. Aslo on clean Windows. I have disconnected the second monitor, changed the DP cables, changed PCIe slot. I have enforced a high performance profile and constant v-sync.

Card is installed in dual e5-2650l Xeon system (PCIe 16 3.0 X99 board, 128 GB ECC memory) with Windows 10 Professional (but I have also tested it in a single-processor PC - same results). PSU is 750W Silentiumpc.

System works completly stable with GTX 650.

System events show the following errors in nvlddmkm:
\Device\Video3
2026eb7e 2026ecb6 2021e2b2 2026d800 2021a5be 00000000 00000000 00000000

\Device\Video3
UCodeReset TDR occurred on GPUID:8400

\Device\Video3
Resetting TDR occurred on GPUID:8400

\Device\00000096
Error occurred on GPUID: 8400

NVIDIA-SMI with 5 secs interval shows after Cinebench test just crash:

Sat Dec 7 20:12:51 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 553.35 Driver Version: 553.35 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A4000 WDDM | 00000000:84:00.0 On | Off |
| 45% 68C P2 137W / 140W | 14846MiB / 16376MiB | 87% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

Sat Dec 7 20:12:56 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 553.35 Driver Version: 553.35 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A4000 WDDM | 00000000:84:00.0 On | Off |
| 48% 66C P2 58W / 140W | 14292MiB / 16376MiB | 1% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

Sat Dec 7 20:13:01 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 553.35 Driver Version: 553.35 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A4000 WDDM | 00000000:84:00.0ERR! | ERR! |
|ERR! 60C ERR! ERR! / ERR! | 14169MiB / 16376MiB | ERR! ERR! |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

etc.

What else can I do to diagnose the card? What could be the cause of the failure? The seller claims to have tested the card and with him the errors did not appear. Indeed, the card is sometimes able to pass the 30-minute Cinebench 2024 stability test immediately after the first power-up when cool, but then immediately crashes on every other task (only when it reaches temperatures below 35 degrees does it become stable enough to run again for several minutes under load). I am very tired of this situation, I don’t know what else I could do to locate the problem?

Small update: on a third system running RTX 3070 ti successfully on a daily basis, the A4000 card crashed Cinebench after 2 minutes of running the test. It displayed a window with the message ‘Unable to recover (…) Error code: 3 (subcode 2)’, the system event log shows numerous errors related to nvlddmkm.