Weird intermittent illegal memory access error

I’m building something that’s conceptually a kind of database using Thrust. I allocate some long-lived device_vector’s on heap using the new operator:

auto v = new thrust::device_vector<int>(startPointer, startPointer + size);

Some time later I perform some operations on that vector:

void filter_device_vector_in_place(thrust::device_vector<int>* d_filter_values) {

            auto result = thrust::remove_if(thrust::device, d_filter_values->begin(), d_filter_values->end(), is_even());

            d_filter_values->erase(result, d_filter_values->end());
            d_filter_values->shrink_to_fit();
        }

Very occasionally this will cause an exception:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  copy_if failed to synchronize: an illegal memory access was encountered

with the following logs in dmesg:

[16498.830496] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
[16498.830503] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x504648=0x5000e 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f
[16498.830547] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address
[16498.830551] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x504e48=0x5000e 0x504e50=0x20 0x504e44=0xd3eff2 0x504e4c=0x17f
[16498.830591] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2): Out Of Range Address
[16498.830594] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x505648=0xe 0x505650=0x20 0x505644=0xd3eff2 0x50564c=0x17f
[16498.830633] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3): Out Of Range Address
[16498.830637] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x505e48=0xa000e 0x505e50=0x20 0x505e44=0xd3eff2 0x505e4c=0x17f
[16498.830676] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4): Out Of Range Address
[16498.830680] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x506648=0xe 0x506650=0x20 0x506644=0xd3eff2 0x50664c=0x17f
[16498.830724] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address
[16498.830728] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50c648=0xe 0x50c650=0x20 0x50c644=0xd3eff2 0x50c64c=0x17f
[16498.830769] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Out Of Range Address
[16498.830773] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50ce48=0x5000e 0x50ce50=0x20 0x50ce44=0xd3eff2 0x50ce4c=0x17f
[16498.830805] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2): Out Of Range Address
[16498.830809] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50d648=0xe 0x50d650=0x20 0x50d644=0xd3eff2 0x50d64c=0x17f
[16498.830841] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3): Out Of Range Address
[16498.830844] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50de48=0x4000e 0x50de50=0x20 0x50de44=0xd3eff2 0x50de4c=0x17f
[16498.830873] NVRM: Xid (PCI:0000:03:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4): Out Of Range Address
[16498.830877] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x50e648=0xe 0x50e650=0x20 0x50e644=0xd3eff2 0x50e64c=0x17f
[16498.831265] NVRM: Xid (PCI:0000:03:00): 43, Ch 00000050, engmask 00000101

This only happens on the very first operation after the app is started - and only every dozen starts or so. If the first execution succeeds then the next thousand also succeed. I’m at a loss as to what’s causing this. Any ideas?

Edit: some more creative googling seems to indicate that this is a driver error (eg. https://devtalk.nvidia.com/default/topic/987301/linux/error-graphics-sm-warp-exception-on-gpc-1-tpc-0-out-of-range-address-xid-13-xid-43-/). I can only trigger it if I restart my app quickly in succession. Does that sound plausible?

I don’t see anything that looks like a driver error. Everything indicates the code is making an illegal access. That should be debugged. I can’t speculate as to the reasons for this with the limited information you’ve provided.

It’s fully expected that if your code makes an illegal access (which doesn’t mean CUDA or the driver is broken, it means your code is broken) that there might be some debug spew that shows up in dmesg. The dmesg spew you’ve extracted seems to correspond exactly with the runtime error indication you are getting.

Right, I thought the same. But I can’t find any illegal access. I’m literally just creating a device vector from a native array and then feeding it into a thrust::remove_if. I don’t see any scope for any illegal access. This is all single threaded as well, unless there’s some sort of async magic going on behind the scenes in Thrust.

Interestingly, starting and stopping a simple test app repeatedly in quick succession triggers the same error. Other CUDA apps then won’t start until 20 to 30 seconds have passed.

I’ll attempt to extract a test case.

Ok, so here’s a tiny minimal test case that reproduces the problem:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/remove.h>

struct is_even
{
    __host__ __device__
    bool operator()(const int x)
    {
        return (x % 2) == 0;
    }
};

void filter_device_vector_in_place(thrust::device_vector<int>* d_filter_values) {
    auto result = thrust::remove_if(thrust::device, d_filter_values->begin(), d_filter_values->end(), is_even());

    d_filter_values->erase(result, d_filter_values->end());
    d_filter_values->shrink_to_fit();
}

int main() {
    int size = 1000000;
    int* values = (int *)malloc(sizeof(int)*size);

    for(int x = 0; x < size; x++) {
        values[x] = rand();
    }

    int *startPointer = values;
    auto v = new thrust::device_vector<int>(startPointer, startPointer + size);

    filter_device_vector_in_place(v);
}

Running this 10 times consecutively triggers that error three or four times in ten.

And an even more simplified example:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/remove.h>

struct is_even
{
    __host__ __device__
    bool operator()(const int x)
    {
        return (x % 2) == 0;
    }
};

int main() {
    int size = 1000000;
    int* values = (int *)malloc(sizeof(int)*size);

    for(int x = 0; x < size; x++) {
        values[x] = rand();
    }

    auto d_filter_values = thrust::device_vector<int>(values, values + size);

    auto result = thrust::remove_if(thrust::device, d_filter_values.begin(), d_filter_values.end(), is_even());
}

This regularly but not always results in:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  copy_if failed to synchronize: an illegal memory access was encountered

Ok, I found the cause: this is being compiled as a shared library / executable I’d neglected to set the same compilation options in all modules. I added

set(CMAKE_CUDA_FLAGS "${CUDA_NVCC_FLAGS} ${CUDA_ARCH} --std=c++14 -arch sm_61")

to every module and it works reliably now.