Error: Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address (Xid 13/Xid 43)

birdie · January 10, 2017, 5:45pm

Certain very intensive CUDA workflows make the driver soft crash (i.e. the corresponding CUDA application crashes but you may restart it):

dmesg/kernel log:

NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101

while the application reports this:

Unspecified launch failure

It would be nice if NVIDIA devs could look into this issue and resolve it. It affects non-overclocked Pascal and Maxwell v2 GPUs.

In my case I have NVIDIA drivers 375.20 and GTX 1060 6GB running at:

Core: 1922MHz
Memory: 7998MHz
Temperature: 53C
GPU usage: 99%
GPU power: 108W

all parameters are within the designated specs.

Edit: this is getting ridiculous: my GPU crashes every 5 minutes.

NVRM: GPU Board Serial Number:
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 2): Physical Multiple Warp Errors
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50d648=0xe 0x50d650=0x24 0x50d644=0xd3eff2 0x50d64c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x505e48=0xe 0x505e50=0x20 0x505e44=0xd3eff2 0x505e4c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504648=0xe 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101

Also reported here:

https://github.com/nginnever/zogminer/issues/73
https://github.com/mbevand/silentarmy/issues/6
https://forums.geforce.com/default/topic/973629/official-375-70-game-ready-whql-display-driver-feedback-thread-released-10-28-16-/?offset=120
https://foldingforum.org/viewtopic.php?f=80&t=29276&start=135
https://forums.geforce.com/default/topic/979695/geforce-drivers/official-376-19-game-ready-whql-display-driver-feedback-thread-released-12-5-16-/18/

According to your Xid errors documentation, these two errors might indicate pretty much everything except a HW error.

birdie · March 17, 2017, 11:08pm

This is reproducible with latest drivers. Sigh.

RagnarRainMaker · December 5, 2017, 12:59am

I too am getting a very similar error, but in my case it has a bit of extra info at the bottom, indicating a possible issue with multi-threading. Are you also seeing the last line?

[ 7968.019355] NVRM: GPU at PCI:0000:03:00: GPU-0af6db73-f4fc-6fab-80e4-899a77ec8749
[ 7968.019367] NVRM: Xid (PCI:0000:03:00): 62, 16ca(17b4) 84000128 96399669 | mb4:ffffffff mb5:ffffffff mb6:ffffffff
[ 7976.007281] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Stack Error
[ 7976.007301] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50c648=0x1 0x50c650=0x0 0x50c644=0x0 0x50c64c=0x8000003b
[ 7976.007341] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50ce48=0x5dd31 0x50ce50=0x0 0x50ce44=0x0 0x50ce4c=0x3e
[ 7976.007375] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50d648=0x19f 0x50d650=0x0 0x50d644=0x18e368 0x50d64c=0x28
[ 7976.007399] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ChID 0008, Class 0000a197, Offset 00000000, Data 00000000
[ 7980.030756] NVRM: Xid (PCI:0000:03:00): 31, Ch 0000000a, engmask 00000111, intr 10000000
[ 7987.489575] NVRM: Xid (PCI:0000:03:00): 8, Channel 00000000
[ 7989.994006] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

I don’t know if this will help, but I’ve heard newer drivers don’t work with some cards. Personally, I’m running the GeForce GTX 780 Ti.

MuffinBoy · February 19, 2021, 1:20am

Hello there, old thread, but just to let you know I’m having the exact same issue running a Quadro P6000 (link: Coredump - Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address - Quadro P6000)

Did you ever find a solution?

Thank you

birdie · April 14, 2021, 8:47pm

I’ve long stopped using that CUDA application, so I’ve no idea.

generix · April 14, 2021, 9:21pm

Might be the watchdog if you’re concurrently running an xserver on it:
https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

birdie · April 15, 2021, 8:46am

A nice find. It’s weird this issue doesn’t affect Windows users AFAIK even though Windows integrates with a GPU a lot more than X.org / Linux / NVIDIA Linux drivers.

generix · April 15, 2021, 12:06pm

AFAIK, Windows should be affected by the same, at least I got the impression that’s the reason for two drivers (models) existing on Windows, the normal driver and the compute driver.

Topic		Replies	Views
Xid Errors (Graphics SM Warp Exception, Graphics SM Global Exception, Graphics Exception: ESR) Linux	0	1070	January 27, 2023
GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot Linux	1	665	January 16, 2019
CUDA Debugger detected HW exception CUDA Programming and Performance	7	769	March 7, 2018
How to deal with the RmInitAdapter failed? The GPU always fell off the bus after that occurred. Linux	0	679	November 15, 2019
Random Xid 62 error on ML workloads - Titan RTX Linux	0	720	July 8, 2020
Warp Illegal Address Jetson TX2	3	1845	October 18, 2021
Xid issues on Ubuntu 16.04 - Kernel 4.17 and 4.17.1, Nvidia 390 and 396 BETA - GTX 1070 Linux	7	1423	June 14, 2018
Intermittent crashing on GTX 1080 TI - Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address Linux	0	1362	October 2, 2020
CUDA_EXCEPTION_8 on Tegra X1 Jetson TX1	3	1532	July 2, 2018
Coredump - Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address - Quadro P6000 Linux tensorflow , ai-training , containers , training	1	3005	February 19, 2021

Error: Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address (Xid 13/Xid 43)

Related topics