RTX 2070 CUDA problem? Cannot run pytorch anymore after a program crashed. The Mandelbrot sample shows artifacts.

Hi,

I have been using RTX 2070 for deep learning training. My configuration is:

OS: Ubuntu 18.04
Driver: 410.78
CUDA: 10.0
Pytorch: 1.0.1-post2 with cuda10.0 support

Everything worked well at the beginning. However, my program suddenly crashed with some error message (which I did not record) lask week and I could not run the program anymore. I thought reinstalling the OS and the driver could solve the problem so I tried it. But after reinstalling everything, it still did not work on either my program or the sample examples provided by pytorch (https://github.com/pytorch/examples, the MNIST one). I found the following error message:

dmesg gave me this error message:

NVRM: Xid (PCI:0000:01:00): 31, Ch 00000058, engmask 00000101, intr 00000000

and the pytorch gave me this error message:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered

I also tried Ubuntu 16.04 and some other driver versions like 418, they gave the same error message.

Moreover, I tried to run the samples provided in NVIDIA_CUDA-10.0_Samples to detect the problem. Most of the samples run well but one of them caught my attention: the 2_Graphics/Mandelbrot showed some artifacts. I upload the image here: https://drive.google.com/file/d/1_8VhR4eS4xHG_kOx4vtd8GKpfeSyKy6-/view.

I wonder whether the two issues are related and whether there are some hardware problems?

Thanks for helping.

Regards,
John
nvidia-bug-report.log.gz (1 MB)

Please use cuda-memtest to check your video memory.

The cuda-memtest seems okay. After running cuda-memtest, I got:

[04/24/2019 15:20:10][air540][0]:Running cuda memtest, version 1.2.2
[04/24/2019 15:20:10][air540][0]:Warning: Getting serial number failed
[04/24/2019 15:20:10][air540][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module  410.78  Sat Nov 10 22:09:04 CST 2018
[04/24/2019 15:20:10][air540][0]:num_gpus=1
[04/24/2019 15:20:10][air540][0]:Device name=GeForce RTX 2070, global memory size=8338604032
[04/24/2019 15:20:10][air540][0]:major=7, minor=5
[04/24/2019 15:20:10][air540][0]:Attached to device 0 successfully.
[04/24/2019 15:20:10][air540][0]:Allocated 7512 MB
[04/24/2019 15:20:10][air540][0]:Test0 [Walking 1 bit]
[04/24/2019 15:20:13][air540][0]:Test0 finished in 3.4 seconds
[04/24/2019 15:20:13][air540][0]:Test1 [Own address test]
[04/24/2019 15:20:15][air540][0]:Test1 finished in 1.6 seconds
[04/24/2019 15:20:15][air540][0]:Test2 [Moving inversions, ones&zeros]
[04/24/2019 15:20:23][air540][0]:Test2 finished in 8.3 seconds
[04/24/2019 15:20:23][air540][0]:Test3 [Moving inversions, 8 bit pat]
[04/24/2019 15:20:31][air540][0]:Test3 finished in 8.4 seconds
[04/24/2019 15:20:31][air540][0]:Test4 [Moving inversions, random pattern]
[04/24/2019 15:20:35][air540][0]:Test4 finished in 4.2 seconds
[04/24/2019 15:20:35][air540][0]:Test5 [Block move, 64 moves]
[04/24/2019 15:20:38][air540][0]:Test5 finished in 2.1 seconds
[04/24/2019 15:20:38][air540][0]:Test6 [Moving inversions, 32 bit pat]
[04/24/2019 15:25:45][air540][0]:Test6 finished in 307.2 seconds
[04/24/2019 15:25:45][air540][0]:Test7 [Random number sequence]
[04/24/2019 15:25:52][air540][0]:Test7 finished in 6.8 seconds
[04/24/2019 15:25:52][air540][0]:Test8 [Modulo 20, random pattern]
[04/24/2019 15:25:52][air540][0]:test8[mod test]: p1=0x77fbd7b1, p2=0x8804284e
[04/24/2019 15:26:07][air540][0]:Test8 finished in 15.2 seconds
[04/24/2019 15:26:07][air540][0]:Test10 [Memory stress test]
[04/24/2019 15:26:07][air540][0]:Test10 with pattern=0x21e9f30f6386f61b
[04/24/2019 15:26:28][air540][0]:Test10 finished in 20.8 seconds

And if I run sanity_check.sh, I got:

[04/24/2019 15:36:25][air540][0]:Running cuda memtest, version 1.2.2
[04/24/2019 15:36:25][air540][0]:Warning: Getting serial number failed
[04/24/2019 15:36:25][air540][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module  410.78  Sat Nov 10 22:09:04 CST 2018
[04/24/2019 15:36:25][air540][0]:num_gpus=1
[04/24/2019 15:36:25][air540][0]:Device name=GeForce RTX 2070, global memory size=8338604032
[04/24/2019 15:36:25][air540][0]:major=7, minor=5
[04/24/2019 15:36:25][air540][0]:Attached to device 0 successfully.
[04/24/2019 15:36:25][air540][0]:Allocated 7497 MB
[04/24/2019 15:36:25][air540][0]:Test10 [Memory stress test]
[04/24/2019 15:36:25][air540][0]:Test10 with pattern=0x2d7c16981e645408
[04/24/2019 15:36:28][air540][0]:Test10 finished in 2.6 seconds
main thread: Program exits

Looks good.
Maybe some thermal fault, you could run

  • gpu-burn for 600s
  • some unigine demo
    to see if any issues show up once the system heats up.
    Might also be system memory related, mabe check if removing/changing memory modules sheds some light on this.

I tried gpu-burn and there was no problem.

Then I tried to lower the system memory frequency in my BIOS. And I found things get better: My Pytorch code could run for several iterations, but it terminated after a while and the dmesg gave me the following error again:

Xid (PCI:0000:01:00): 31, Ch 00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f91_2b206000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Btw, I updated my driver to 430.09, so that more detailed error was shown.

I have found something interesting. The Mandelbrot error seems indeed related to my Pytorch error.

If I restart my computer or wait for a while after getting an Xid 31 error. The Pytorch program works again. At this point, if I run the Mandelbrot program, the Mandelbrot will be good without any artifacts.

While unfortunately, under this healthy state, if I start the Pytorch program again, after several epochs, I will get another Xid 31 and the GPU goes to ill mode again. Pytorch will give me Xid 31 immediately after starting the program and the Mandelbrot will show the same artifacts as before.

Since Mandelbrot is easier than the Pytorch source code, I dug into it to find what’s going on. Since the Mandelbrot program gives me artifacts gradually, I found it is related to the anti-aliasing part in the following function in Mandelbrot_cuda.cu.

template<class T>
__global__ void Mandelbrot0(uchar4 *dst, const int imageW, const int imageH, const int crunch, const T xOff, const T yOff,
                            const T xJP, const T yJP, const T scale, const uchar4 colors, const int frame,
                            const int animationFrame, const int gridWidth, const int numBlocks, const bool isJ)
{
// omited...
if (frame == 0)  // the first frame is okay. If I let this condition be always true, there is no artifacts.
{
    color.w = 0;
    dst[pixel] = color;
}
else // calculation error happens here (after several anti-alising frames), some blocks/threads give the wrong result
{
    int frame1 = frame + 1;
    int frame2 = frame1 / 2;
    dst[pixel].x = (dst[pixel].x * frame + color.x + frame2) / frame1; // may give wrong result
    dst[pixel].y = (dst[pixel].y * frame + color.y + frame2) / frame1; // may give wrong result
    dst[pixel].z = (dst[pixel].z * frame + color.z + frame2) / frame1; // may give wrong result
}
// omited...
} // Mandelbrot0

After finding this, I wrote a tiny cuda program to reproduce this stuation, the code is:

#include <stdio.h>

__global__ void f(int t)
{
  int a = 2;
  a = (a * (t + 1)) / (t + 1);
  printf("%d @ block%d,thread%d with t=%d\n", a, blockIdx.x, threadIdx.x, t);
}

int main(void)
{
  for (int t = 0; t < 10; ++t)
  {
    f<<<1, 16>>>(t);
  }

  cudaDeviceSynchronize();
}

In the healthy mode (where Pytorch can run and No artifact in Mandelbrot), the message are:

2 @ block???,thread??? with t=??? // "???" is an integer.

But after running some pytorch epochs and getting to the ill mode, I run this tiny program and got some 6 instead of 2 when t=5 in some threads.

2 @ block0,thread1 with t=5
...
2 @ block0,thread11 with t=5
6 @ block0,thread12 with t=5
6 @ block0,thread13 with t=5
6 @ block0,thread14 with t=5
6 @ block0,thread15 with t=5
2 @ block0,thread16 with t=5
...
2 @ block0,thread27 with t=5
6 @ block0,thread28 with t=5
6 @ block0,thread29 with t=5
6 @ block0,thread30 with t=5
6 @ block0,thread31 with t=5
2 @ block0,thread0 with t=6

Since I don’t have too much knowledge about the hardware, I stopped here. Does somebody know what’s happening to this?