I have been using RTX 2070 for deep learning training. My configuration is:
OS: Ubuntu 18.04
Pytorch: 1.0.1-post2 with cuda10.0 support
Everything worked well at the beginning. However, my program suddenly crashed with some error message (which I did not record) lask week and I could not run the program anymore. I thought reinstalling the OS and the driver could solve the problem so I tried it. But after reinstalling everything, it still did not work on either my program or the sample examples provided by pytorch (https://github.com/pytorch/examples, the MNIST one). I found the following error message:
dmesg gave me this error message:
NVRM: Xid (PCI:0000:01:00): 31, Ch 00000058, engmask 00000101, intr 00000000
and the pytorch gave me this error message:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered
I also tried Ubuntu 16.04 and some other driver versions like 418, they gave the same error message.
Moreover, I tried to run the samples provided in NVIDIA_CUDA-10.0_Samples to detect the problem. Most of the samples run well but one of them caught my attention: the 2_Graphics/Mandelbrot showed some artifacts. I upload the image here: https://drive.google.com/file/d/1_8VhR4eS4xHG_kOx4vtd8GKpfeSyKy6-/view.
I wonder whether the two issues are related and whether there are some hardware problems?
Thanks for helping.
nvidia-bug-report.log.gz (1 MB)