I’m having problems with intermittent crashing while training machine learning models on a dual-1080TI machine.
I’ve reproduced the same crash using both TensorFlow and PyTorch.
TensorFlow raises the following Python exception:
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
And running a PyTorch benchmark utility I found results in the following output:
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=220 error=700 : an illegal memory access was encountered
target = torch.LongTensor(args.BATCH_SIZE).random_(args.NUM_CLASSES).cuda()
Traceback (most recent call last):
File "benchmark_models.py", line 148, in <module>
train_result=train(precision)
File "benchmark_models.py", line 80, in train
torch.cuda.synchronize()
File "/home/matt/tf_env/lib/python3.6/site-packages/torch/cuda/__init__.py", line 398, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/torch/csrc/cuda/Module.cpp:220
Looking at the syslog at the time of the crash, both have some variation of the following pattern (this is the TensorFlow output):
Oct 2 09:00:46 barracuda kernel: [53909.532130] NVRM: GPU at PCI:0000:03:00: GPU-fe2c5425-28f7-dca4-b34b-b6a8796147e5
Oct 2 09:00:46 barracuda kernel: [53909.532132] NVRM: GPU Board Serial Number:
Oct 2 09:00:46 barracuda kernel: [53909.532134] NVRM: Xid (PCI:0000:03:00): 13, pid=1062, Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address
Oct 2 09:00:46 barracuda kernel: [53909.532142] NVRM: Xid (PCI:0000:03:00): 13, pid=1062, Graphics Exception: ESR 0x514648=0x100000e 0x514650=0x20 0x514644=0xd3eff2 0x51464c=0x17f
Oct 2 09:00:46 barracuda kernel: [53909.535480] NVRM: Xid (PCI:0000:03:00): 43, pid=18219, Ch 00000010
Looking up the TensorFlow error, most people fix it either by setting the gpu options to allow memory growth or reinstalling libhdf5-dev
and python-h5py
, but I can trigger the error using < 10% of VRAM and h5
is up to date.
Is there any way of confirming whether this is a hardware issue or not?
nvidia-bug-report.log (218.8 KB)