Certain very intensive CUDA workflows make the driver soft crash (i.e. the corresponding CUDA application crashes but you may restart it):
dmesg/kernel log:
NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
while the application reports this:
Unspecified launch failure
It would be nice if NVIDIA devs could look into this issue and resolve it. It affects non-overclocked Pascal and Maxwell v2 GPUs.
In my case I have NVIDIA drivers 375.20 and GTX 1060 6GB running at:
Core: 1922MHz
Memory: 7998MHz
Temperature: 53C
GPU usage: 99%
GPU power: 108W
all parameters are within the designated specs.
Edit: this is getting ridiculous: my GPU crashes every 5 minutes.
NVRM: GPU Board Serial Number:
NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 2): Physical Multiple Warp Errors
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50d648=0xe 0x50d650=0x24 0x50d644=0xd3eff2 0x50d64c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x505e48=0xe 0x505e50=0x20 0x505e44=0xd3eff2 0x505e4c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504648=0xe 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f
NVRM: Xid (PCI:0000:01:00): 43, Ch 00000020, engmask 00000101
Also reported here:
https://github.com/nginnever/zogminer/issues/73
https://github.com/mbevand/silentarmy/issues/6
https://forums.geforce.com/default/topic/973629/official-375-70-game-ready-whql-display-driver-feedback-thread-released-10-28-16-/?offset=120
https://foldingforum.org/viewtopic.php?f=80&t=29276&start=135
https://forums.geforce.com/default/topic/979695/geforce-drivers/official-376-19-game-ready-whql-display-driver-feedback-thread-released-12-5-16-/18/
According to your Xid errors documentation, these two errors might indicate pretty much everything except a HW error.