We have been training CNNs with Pytorch, at some point the Titan RTX started spawning errors.
Errors happened on Ubuntu 19.10 with both:
- Driver Version: 435.21 CUDA Version: 10.1
- Driver Version: 440.59 CUDA Version: 10.2
- CUDNN_STATUS_EXECUTION_FAILED error in Pytorch 1.4.0 cuDNN backend started with CUDA_LAUNCH_BLOCKING=1 (and yes, it works with cpu backend)
- "Invalid memory access" error during memory stress test from https://github.com/ComputationalRadiationPhysics/cuda_memtest, see error.png
- Nvidia-bug-report.tar.gz is attached.
The same code has been tested on other windows and linux machines and runs perfectly there.
Driver reinstallation & Ubuntu reinstallation both did not solve the problem.
We assume the device has some memory defect. Any ideas on this?
nvidia-bug-report.log.gz (516 KB)