Titan RTX failing when using 7~8 GB or higher memory

Pasted here the questionaire provided by nVidia Support team, already answered.

1. How are you determining that the card is failing when using 7~8 GB or higher memory ? Are you using any specific tool, etc., to monitor the amount of memory used when it crashes ?

We noticed that the card started failing (in particular, returning NaN values in PyTorch) when the workload exceeded the amount of 7-8GB of memory usage. We verified the amount of memory being used by the GPU via nvidia-smi.

2. Have you tried reinstalling the driver for the graphics card ? If yes, then did you perform the Express or the Custom install for the drivers? Also did you perform a clean install ?

We performed a clean install and the current driver installed is: Driver Version: 465.19.01.

3. Do you notice the crash only while in specific game / application, etc., ? Please mention the games / applications in which you notice the issue

Currently, we have only tested the GPU with PyTorch. However, since it only fails when reaching a certain memory threshold, we thought the issue would be on the graphics memory. To verify this, we downloaded & compiled cuda_memtest (github.com/ComputationalRadiationPhysics/cuda_memtest). The output of cuda_memtest verified our suspicions, almost immediately detecting multiple errors on the block 6784. We provide full logs at the end of the question. Please note that the system currently has also installed an RTX 2070 SUPER and that unit does not have any of the issues that the RTX TITAN exhibits.

4. What is the Wattage of the power supply in the computer in Watts?

The power supply is a Nox Hummer GD850 850W 80 Plus Gold.

These graphic cards need additional power connectors to be plugged in with direct cables from the Power Supply, could you confirm if these cables are plugged in securely ?

Yes.

5. Have you installed the latest updates for the operating system? Have you updated the latest motherboard chipset drivers and the BIOS ?

Yes. We did a complete reconfiguration when we started to experience problems.