Improper use of __ldg() causes illegal memory access

Okay, here’s a kernel I wrote :

struct tetrahedron
{
    int v[4];

    __host__ __device__
    tetrahedron(void)
    {
        v[0] = v[1] = v[2] = v[3] = -1;
    };

    __host__ __device__
    tetrahedron(const int a, 
                const int b, 
                const int c, 
                const int d)
    {
        v[0] = a;
        v[1] = b;
        v[2] = c;
        v[3] = d;
    };

    __host__ __device__
    void print(void) const
    {
        printf("%d, %d, %d, %d\n", v[0], v[1], v[2], v[3]);
    }
};

__device__ __host__
unsigned long reverse(unsigned x)
{
    const int prime_factor = 17;
    const int prime_offset = 3;
    unsigned long x_copy = x * prime_factor + prime_offset;
    assert(x_copy <= UINT_MAX);
    x = x_copy;

    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
    return((x >> 16) | (x << 16));
}

__global__
void hash_faces(const int                         num_tet,
                const int*           __restrict__ tetra_index,
                const tetrahedron*   __restrict__ mesh,
                      unsigned long* __restrict__ hf,
                      int*           __restrict__ ht,
                      int*           __restrict__ po)
{
    const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

    for (int tid = thread_id; tid < num_tet; tid += grid_size)
    {
        // load in tetrahedron
        const int tet_id    = __ldg(tetra_index + tid);
        const tetrahedron t = mesh[tet_id];

        // load in vertices
        /*const int v0 = __ldg(&t.v[0]);
        const int v1 = __ldg(&t.v[1]);
        const int v2 = __ldg(&t.v[2]);
        const int v3 = __ldg(&t.v[3]);*/

        const int v0 = t.v[0];
        const int v1 = t.v[1];
        const int v2 = t.v[2];
        const int v3 = t.v[3];

        // hash faces
        unsigned long hash[4] = { 0 };

        hash[0] = reverse(v3) + reverse(v2) + reverse(v1);
        hash[1] = reverse(v0) + reverse(v2) + reverse(v3);
        hash[2] = reverse(v0) + reverse(v3) + reverse(v1);
        hash[3] = reverse(v0) + reverse(v1) + reverse(v2);

        // write back
        const int offset = 4 * tid; // 4 faces per tetrahedron

        // hashed faces
        hf[offset + 0] = hash[0];
        hf[offset + 1] = hash[1];
        hf[offset + 2] = hash[2];
        hf[offset + 3] = hash[3];

        // index of hashed tetrahedra
        ht[offset + 0] =
        ht[offset + 1] =
        ht[offset + 2] =
        ht[offset + 3] = tet_id;

        // index of points opposite to each hashed face
        po[offset + 0] = 0;
        po[offset + 1] = 1;
        po[offset + 2] = 2;
        po[offset + 3] = 3;
    }
}

Mesh accesses are accurate. I did a host-side print and it worked fine. Writes seem accurate.

Basically, my code only crashes if I load in v0 through v3 with ldg instead of just a normal dereference. Am I doing this wrong? The compiler didn’t give me any syntax errors so I’m imagining that I’m using __ldg() was it was intended.

Or am I overloading my cache which subsequently creates lines being tossed out?

What exactly does “my code crashes” mean? What error status is reported?

When you read through the texture path, it is mandatory that the data being read is not being written to for the entire duration of the kernel. This is because the texture cache is non-coherent. Use of the __ldg() intrinsic specifically instructs the hardware to read through the texture path, so the programmer is responsible for making sure that the property “data is read-only for the entire duration of the kernel” holds true.

If data in global memory is modified and subsequently accessed through the texture path during the same kernel launch, the data read could be either the old or the new data, this is indeterminate. That could certainly cause “badness” in your program.

I’m not exactly sure but even this will cause an error in my kernel :

// load in vertices
        const int v0 = __ldg(&t.v[0]);
        const int v1 = __ldg(&t.v[1]);
        const int v2 = __ldg(&t.v[2]);
        const int v3 = __ldg(&t.v[3]);

        //const int v0 = t.v[0];
        //const int v1 = t.v[1];
        //const int v2 = t.v[2];
        //const int v3 = t.v[3];

        printf("%d, %d, %d, %d\n", v0, v1, v2, v3);

This is the exact error :

terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
  what():  std::bad_alloc: an illegal memory access was encountered
Aborted (core dumped)

Kernel is called in function by :

thrust::device_vector<unsigned long> hf(4 * num_tetra, -1);
	thrust::device_vector<int> ht(hf.size(), -1),
                                   po(hf.size(), -1);

	thrust::device_vector<int> tetra_index(num_tetra, -1);
	thrust::sequence(tetra_index.begin(), tetra_index.end());	

	hash_faces<<<bpg, tpb>>>
			  (num_tetra,
                           thrust::raw_pointer_cast(tetra_index.data()),
                           mesh.get(),
                           thrust::raw_pointer_cast(hf.data()),
                           thrust::raw_pointer_cast(ht.data()),
                           thrust::raw_pointer_cast(po.data()));

	cudaDeviceSynchronize();

@mutantjohn, why are you treating the ints v0-v3 as pointers after you already loaded ‘t’ and its 4 elements on line 60?

If you want to use LDG to load the 16-byte tetrahedron struct on line 60 then I would either make the struct a union with the int4 vector type as a member or take a more crufty approach and just cast mesh to an int4:

__ldg((int4*)mesh + tet_id)

I can’t tell how the error message from Thrust relates to your use of __ldg(). Looks like it is time to break out the debugger and see where the bad addresses creep in that ultimately cause the out-of-bounds access.

t is a local variable.

__ldg is intended to be used for global-space loads.

[url]Programming Guide :: CUDA Toolkit Documentation

(just adding to what allanmac said - your load sequence doesn’t make sense to me. You could use __ldg with a mesh-based pointer, as suggested.)

Won’t I have to do this for every index I want to extract from the structure though?

Accessing vertices 0 through 3 would then become :

const int v0 = __ldg((int* ) mesh + tet_id + 0);
const int v1 = __ldg((int* ) mesh + tet_id + 1);
const int v2 = __ldg((int* ) mesh + tet_id + 2);
const int v3 = __ldg((int* ) mesh + tet_id + 3);

,
correct?

Edit : The above does work though. That’s a neat trick. That’s a really neat trick.

You don’t have to do it that way.

LDG(T*) supports all the standard scalar and vector types ranging from 1 to 16 bytes.

I tend to think of all loads/stores in terms of the width, coalescibility, and transaction size (32 bytes, 64 bytes, 128 bytes, etc.).

In your case, sizeof(tetradhedron) appears to be 16 and you can efficiently load it in one instruction. The tetrahedron instance will be held in 4 32-bit registers in each thread.

I’m dumb!

Not gonna lie, when you said int4, I thought it was a CUDA primitive where each int was 4 bytes when in reality, it’s the size of 4 ints so that’s why it’s 16 bytes! Oh snap. This changes the game. This changes everything!

taps madly on keyboard

Edit :

Okay, is this a better access pattern or should I just flat out stop with the local thread copies?

// load in tetrahedron
        const int tet_id    = __ldg(tetra_index + tid);
        const int4 t = __ldg((int4* ) mesh + tet_id); // this is now in cache, right?

        // load in vertices
        // and these would live in the registers, correct?
        const int v0 = ((int* ) &t)[0];
        const int v1 = ((int* ) &t)[1];
        const int v2 = ((int* ) &t)[2];
        const int v3 = ((int* ) &t)[3];

Would it be best to just use the variables as I need them by calling :

((int* ) &t)[i]

?

I only ask because the __ldg() documentation says that it returns T and not T* so that’s why my value extraction looks a bit funny.

I wouldn’t copy the fields. Once you’ve loaded ‘t’ it’s already sitting in 4 registers in the thread. Just access it as you would a “by value” struct on any other platform.

Also, it’s more compiler-friendly to access struct members by name (a,b,c,d) instead of indexing into an array:

struct tetra { int a; int b; int c; int d; };

Grokking vector loads and stores takes some time but the vector types are really useful.

Another thing to consider when loading structs of any size into a thread is that the CUDA compiler is really smart. For example, if you’re uniform loading an 8 float structure and not using the last two floats anywhere in the kernel the compiler will split the load into a float4 followed by a float2 load.