Error: Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address (Xid 13/Xid 43)

Certain very intensive CUDA workflows make the driver soft crash (i.e. the corresponding CUDA application crashes but you may restart it):

dmesg/kernel log:

NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101

while the application reports this:

Unspecified launch failure

It would be nice if NVIDIA devs could look into this issue and resolve it. It affects non-overclocked Pascal and Maxwell v2 GPUs.

In my case I have NVIDIA drivers 375.20 and GTX 1060 6GB running at:

Core: 1922MHz
Memory: 7998MHz
Temperature: 53C
GPU usage: 99%
GPU power: 108W

all parameters are within the designated specs.

Edit: this is getting ridiculous: my GPU crashes every 5 minutes.

NVRM: GPU Board Serial Number:
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 2): Physical Multiple Warp Errors
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50d648=0xe 0x50d650=0x24 0x50d644=0xd3eff2 0x50d64c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x505e48=0xe 0x505e50=0x20 0x505e44=0xd3eff2 0x505e4c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
 NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
 NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504648=0xe 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f
 NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101

Also reported here:

https://github.com/nginnever/zogminer/issues/73
https://github.com/mbevand/silentarmy/issues/6
https://forums.geforce.com/default/topic/973629/official-375-70-game-ready-whql-display-driver-feedback-thread-released-10-28-16-/?offset=120
https://foldingforum.org/viewtopic.php?f=80&t=29276&start=135
https://forums.geforce.com/default/topic/979695/geforce-drivers/official-376-19-game-ready-whql-display-driver-feedback-thread-released-12-5-16-/18/
https://bitcointalk.org/index.php?topic=1707546.640

According to your Xid errors documentation, these two errors might indicate pretty much everything except a HW error.


xid_13_43.png

This is reproducible with latest drivers. Sigh.

I too am getting a very similar error, but in my case it has a bit of extra info at the bottom, indicating a possible issue with multi-threading. Are you also seeing the last line?

[ 7968.019355] NVRM: GPU at PCI:0000:03:00: GPU-0af6db73-f4fc-6fab-80e4-899a77ec8749
[ 7968.019367] NVRM: Xid (PCI:0000:03:00): 62, 16ca(17b4) 84000128 96399669 | mb4:ffffffff mb5:ffffffff mb6:ffffffff
[ 7976.007281] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Stack Error
[ 7976.007301] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50c648=0x1 0x50c650=0x0 0x50c644=0x0 0x50c64c=0x8000003b
[ 7976.007341] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50ce48=0x5dd31 0x50ce50=0x0 0x50ce44=0x0 0x50ce4c=0x3e
[ 7976.007375] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50d648=0x19f 0x50d650=0x0 0x50d644=0x18e368 0x50d64c=0x28
[ 7976.007399] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ChID 0008, Class 0000a197, Offset 00000000, Data 00000000
[ 7980.030756] NVRM: Xid (PCI:0000:03:00): 31, Ch 0000000a, engmask 00000111, intr 10000000
[ 7987.489575] NVRM: Xid (PCI:0000:03:00): 8, Channel 00000000
[ 7989.994006] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

I don’t know if this will help, but I’ve heard newer drivers don’t work with some cards. Personally, I’m running the GeForce GTX 780 Ti.

Hello there, old thread, but just to let you know I’m having the exact same issue running a Quadro P6000 (link: Use Unet Industrial NGC docker on Quadro P6000 - TensorFlow issue?)

Did you ever find a solution?

Thank you

I’ve long stopped using that CUDA application, so I’ve no idea.

Might be the watchdog if you’re concurrently running an xserver on it:
https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

A nice find. It’s weird this issue doesn’t affect Windows users AFAIK even though Windows integrates with a GPU a lot more than X.org / Linux / NVIDIA Linux drivers.

AFAIK, Windows should be affected by the same, at least I got the impression that’s the reason for two drivers (models) existing on Windows, the normal driver and the compute driver.