Intermittent crashing on GTX 1080 TI - Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address

I’m having problems with intermittent crashing while training machine learning models on a dual-1080TI machine.

I’ve reproduced the same crash using both TensorFlow and PyTorch.

TensorFlow raises the following Python exception:
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

And running a PyTorch benchmark utility I found results in the following output:

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=220 error=700 : an illegal memory access was encountered
target = torch.LongTensor(args.BATCH_SIZE).random_(args.NUM_CLASSES).cuda()
Traceback (most recent call last):
  File "benchmark_models.py", line 148, in <module>
    train_result=train(precision)
  File "benchmark_models.py", line 80, in train
    torch.cuda.synchronize()
  File "/home/matt/tf_env/lib/python3.6/site-packages/torch/cuda/__init__.py", line 398, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/torch/csrc/cuda/Module.cpp:220

Looking at the syslog at the time of the crash, both have some variation of the following pattern (this is the TensorFlow output):

Oct  2 09:00:46 barracuda kernel: [53909.532130] NVRM: GPU at PCI:0000:03:00: GPU-fe2c5425-28f7-dca4-b34b-b6a8796147e5
Oct  2 09:00:46 barracuda kernel: [53909.532132] NVRM: GPU Board Serial Number:
Oct  2 09:00:46 barracuda kernel: [53909.532134] NVRM: Xid (PCI:0000:03:00): 13, pid=1062, Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address
Oct  2 09:00:46 barracuda kernel: [53909.532142] NVRM: Xid (PCI:0000:03:00): 13, pid=1062, Graphics Exception: ESR 0x514648=0x100000e 0x514650=0x20 0x514644=0xd3eff2 0x51464c=0x17f
Oct  2 09:00:46 barracuda kernel: [53909.535480] NVRM: Xid (PCI:0000:03:00): 43, pid=18219, Ch 00000010

Looking up the TensorFlow error, most people fix it either by setting the gpu options to allow memory growth or reinstalling libhdf5-dev and python-h5py, but I can trigger the error using < 10% of VRAM and h5 is up to date.

Is there any way of confirming whether this is a hardware issue or not?

nvidia-bug-report.log (218.8 KB)

3 Likes