Titan RTX memory access error/malfunction/bug

We have been training CNNs with Pytorch, at some point the Titan RTX started spawning errors.

Errors happened on Ubuntu 19.10 with both:

  • Driver Version: 435.21 CUDA Version: 10.1
  • Driver Version: 440.59 CUDA Version: 10.2

Errors:

  • CUDNN_STATUS_EXECUTION_FAILED error in Pytorch 1.4.0 cuDNN backend started with CUDA_LAUNCH_BLOCKING=1 (and yes, it works with cpu backend)
  • "Invalid memory access" error during memory stress test from https://github.com/ComputationalRadiationPhysics/cuda_memtest, see error.png
  • Nvidia-bug-report.tar.gz is attached.

The same code has been tested on other windows and linux machines and runs perfectly there.
Driver reinstallation & Ubuntu reinstallation both did not solve the problem.

We assume the device has some memory defect. Any ideas on this?
error.png
nvidia-bug-report.log.gz (516 KB)