Intermittent crashing on GTX 1080 TI - Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address

mpotma · October 2, 2020, 5:14pm

I’m having problems with intermittent crashing while training machine learning models on a dual-1080TI machine.

I’ve reproduced the same crash using both TensorFlow and PyTorch.

TensorFlow raises the following Python exception:
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

And running a PyTorch benchmark utility I found results in the following output:

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=220 error=700 : an illegal memory access was encountered
target = torch.LongTensor(args.BATCH_SIZE).random_(args.NUM_CLASSES).cuda()
Traceback (most recent call last):
  File "benchmark_models.py", line 148, in <module>
    train_result=train(precision)
  File "benchmark_models.py", line 80, in train
    torch.cuda.synchronize()
  File "/home/matt/tf_env/lib/python3.6/site-packages/torch/cuda/__init__.py", line 398, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/torch/csrc/cuda/Module.cpp:220

Looking at the syslog at the time of the crash, both have some variation of the following pattern (this is the TensorFlow output):

Oct  2 09:00:46 barracuda kernel: [53909.532130] NVRM: GPU at PCI:0000:03:00: GPU-fe2c5425-28f7-dca4-b34b-b6a8796147e5
Oct  2 09:00:46 barracuda kernel: [53909.532132] NVRM: GPU Board Serial Number:
Oct  2 09:00:46 barracuda kernel: [53909.532134] NVRM: Xid (PCI:0000:03:00): 13, pid=1062, Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address
Oct  2 09:00:46 barracuda kernel: [53909.532142] NVRM: Xid (PCI:0000:03:00): 13, pid=1062, Graphics Exception: ESR 0x514648=0x100000e 0x514650=0x20 0x514644=0xd3eff2 0x51464c=0x17f
Oct  2 09:00:46 barracuda kernel: [53909.535480] NVRM: Xid (PCI:0000:03:00): 43, pid=18219, Ch 00000010

Looking up the TensorFlow error, most people fix it either by setting the gpu options to allow memory growth or reinstalling libhdf5-dev and python-h5py, but I can trigger the error using < 10% of VRAM and h5 is up to date.

Is there any way of confirming whether this is a hardware issue or not?

nvidia-bug-report.log (218.8 KB)

Topic		Replies	Views
GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot Linux	1	665	January 16, 2019
System crash with GPU Linux	0	320	June 17, 2021
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	4977	April 26, 2017
cudaMemset: illegal memory access with RTX5090 with 570.86.16 CUDA Programming and Performance llama	24	452	July 16, 2025
Error: Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address (Xid 13/Xid 43) Linux	7	23241	April 15, 2021
GPU crashes when running machine learning models cuDNN	2	6077	October 12, 2021
Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on Ubuntu 18.04 with TensorFlow 2.2.0 Frameworks cuda , tensorflow	3	7940	January 5, 2023
Titan RTX memory access error/malfunction/bug Linux	0	595	March 6, 2020
GPU Illegal Memory Access when using pycuda/tensorrt with FP16 input TensorRT	3	1346	December 1, 2022
Kernel panic when training with PyTorch & GTX1080Ti Frameworks kernel	0	722	September 9, 2021

Intermittent crashing on GTX 1080 TI - Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address

Related topics