I’m building something that’s conceptually a kind of database using Thrust. I allocate some long-lived device_vector’s on heap using the new operator:
auto v = new thrust::device_vector<int>(startPointer, startPointer + size);
Some time later I perform some operations on that vector:
void filter_device_vector_in_place(thrust::device_vector<int>* d_filter_values) {
auto result = thrust::remove_if(thrust::device, d_filter_values->begin(), d_filter_values->end(), is_even());
d_filter_values->erase(result, d_filter_values->end());
d_filter_values->shrink_to_fit();
}
Very occasionally this will cause an exception:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): copy_if failed to synchronize: an illegal memory access was encountered
with the following logs in dmesg:
[16498.830496] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
[16498.830503] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x504648=0x5000e 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f
[16498.830547] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address
[16498.830551] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x504e48=0x5000e 0x504e50=0x20 0x504e44=0xd3eff2 0x504e4c=0x17f
[16498.830591] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2): Out Of Range Address
[16498.830594] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x505648=0xe 0x505650=0x20 0x505644=0xd3eff2 0x50564c=0x17f
[16498.830633] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3): Out Of Range Address
[16498.830637] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x505e48=0xa000e 0x505e50=0x20 0x505e44=0xd3eff2 0x505e4c=0x17f
[16498.830676] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4): Out Of Range Address
[16498.830680] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x506648=0xe 0x506650=0x20 0x506644=0xd3eff2 0x50664c=0x17f
[16498.830724] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
[16498.830728] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
[16498.830769] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Out Of Range Address
[16498.830773] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50ce48=0x5000e 0x50ce50=0x20 0x50ce44=0xd3eff2 0x50ce4c=0x17f
[16498.830805] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2): Out Of Range Address
[16498.830809] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50d648=0xe 0x50d650=0x20 0x50d644=0xd3eff2 0x50d64c=0x17f
[16498.830841] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3): Out Of Range Address
[16498.830844] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50de48=0x4000e 0x50de50=0x20 0x50de44=0xd3eff2 0x50de4c=0x17f
[16498.830873] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4): Out Of Range Address
[16498.830877] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50e648=0xe 0x50e650=0x20 0x50e644=0xd3eff2 0x50e64c=0x17f
[16498.831265] NVRM: Xid (PCI:0000:03:00): 43, Ch 00000050, engmask 00000101
This only happens on the very first operation after the app is started - and only every dozen starts or so. If the first execution succeeds then the next thousand also succeed. I’m at a loss as to what’s causing this. Any ideas?
Edit: some more creative googling seems to indicate that this is a driver error (eg. https://devtalk.nvidia.com/default/topic/987301/linux/error-graphics-sm-warp-exception-on-gpc-1-tpc-0-out-of-range-address-xid-13-xid-43-/). I can only trigger it if I restart my app quickly in succession. Does that sound plausible?