Improper use of __ldg() causes illegal memory access

MutantJohn · January 30, 2015, 9:29pm

Okay, here’s a kernel I wrote :

struct tetrahedron
{
    int v[4];

    __host__ __device__
    tetrahedron(void)
    {
        v[0] = v[1] = v[2] = v[3] = -1;
    };

    __host__ __device__
    tetrahedron(const int a, 
                const int b, 
                const int c, 
                const int d)
    {
        v[0] = a;
        v[1] = b;
        v[2] = c;
        v[3] = d;
    };

    __host__ __device__
    void print(void) const
    {
        printf("%d, %d, %d, %d\n", v[0], v[1], v[2], v[3]);
    }
};

__device__ __host__
unsigned long reverse(unsigned x)
{
    const int prime_factor = 17;
    const int prime_offset = 3;
    unsigned long x_copy = x * prime_factor + prime_offset;
    assert(x_copy <= UINT_MAX);
    x = x_copy;

    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
    return((x >> 16) | (x << 16));
}

__global__
void hash_faces(const int                         num_tet,
                const int*           __restrict__ tetra_index,
                const tetrahedron*   __restrict__ mesh,
                      unsigned long* __restrict__ hf,
                      int*           __restrict__ ht,
                      int*           __restrict__ po)
{
    const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

    for (int tid = thread_id; tid < num_tet; tid += grid_size)
    {
        // load in tetrahedron
        const int tet_id    = __ldg(tetra_index + tid);
        const tetrahedron t = mesh[tet_id];

        // load in vertices
        /*const int v0 = __ldg(&t.v[0]);
        const int v1 = __ldg(&t.v[1]);
        const int v2 = __ldg(&t.v[2]);
        const int v3 = __ldg(&t.v[3]);*/

        const int v0 = t.v[0];
        const int v1 = t.v[1];
        const int v2 = t.v[2];
        const int v3 = t.v[3];

        // hash faces
        unsigned long hash[4] = { 0 };

        hash[0] = reverse(v3) + reverse(v2) + reverse(v1);
        hash[1] = reverse(v0) + reverse(v2) + reverse(v3);
        hash[2] = reverse(v0) + reverse(v3) + reverse(v1);
        hash[3] = reverse(v0) + reverse(v1) + reverse(v2);

        // write back
        const int offset = 4 * tid; // 4 faces per tetrahedron

        // hashed faces
        hf[offset + 0] = hash[0];
        hf[offset + 1] = hash[1];
        hf[offset + 2] = hash[2];
        hf[offset + 3] = hash[3];

        // index of hashed tetrahedra
        ht[offset + 0] =
        ht[offset + 1] =
        ht[offset + 2] =
        ht[offset + 3] = tet_id;

        // index of points opposite to each hashed face
        po[offset + 0] = 0;
        po[offset + 1] = 1;
        po[offset + 2] = 2;
        po[offset + 3] = 3;
    }
}

Mesh accesses are accurate. I did a host-side print and it worked fine. Writes seem accurate.

Basically, my code only crashes if I load in v0 through v3 with ldg instead of just a normal dereference. Am I doing this wrong? The compiler didn’t give me any syntax errors so I’m imagining that I’m using __ldg() was it was intended.

Or am I overloading my cache which subsequently creates lines being tossed out?

njuffa · January 30, 2015, 9:38pm

What exactly does “my code crashes” mean? What error status is reported?

When you read through the texture path, it is mandatory that the data being read is not being written to for the entire duration of the kernel. This is because the texture cache is non-coherent. Use of the __ldg() intrinsic specifically instructs the hardware to read through the texture path, so the programmer is responsible for making sure that the property “data is read-only for the entire duration of the kernel” holds true.

If data in global memory is modified and subsequently accessed through the texture path during the same kernel launch, the data read could be either the old or the new data, this is indeterminate. That could certainly cause “badness” in your program.

MutantJohn · January 30, 2015, 9:49pm

I’m not exactly sure but even this will cause an error in my kernel :

// load in vertices
        const int v0 = __ldg(&t.v[0]);
        const int v1 = __ldg(&t.v[1]);
        const int v2 = __ldg(&t.v[2]);
        const int v3 = __ldg(&t.v[3]);

        //const int v0 = t.v[0];
        //const int v1 = t.v[1];
        //const int v2 = t.v[2];
        //const int v3 = t.v[3];

        printf("%d, %d, %d, %d\n", v0, v1, v2, v3);

This is the exact error :

terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
  what():  std::bad_alloc: an illegal memory access was encountered
Aborted (core dumped)

Kernel is called in function by :

thrust::device_vector<unsigned long> hf(4 * num_tetra, -1);
	thrust::device_vector<int> ht(hf.size(), -1),
                                   po(hf.size(), -1);

	thrust::device_vector<int> tetra_index(num_tetra, -1);
	thrust::sequence(tetra_index.begin(), tetra_index.end());	

	hash_faces<<<bpg, tpb>>>
			  (num_tetra,
                           thrust::raw_pointer_cast(tetra_index.data()),
                           mesh.get(),
                           thrust::raw_pointer_cast(hf.data()),
                           thrust::raw_pointer_cast(ht.data()),
                           thrust::raw_pointer_cast(po.data()));

	cudaDeviceSynchronize();

allanmac · January 30, 2015, 9:54pm

@mutantjohn, why are you treating the ints v0-v3 as pointers after you already loaded ‘t’ and its 4 elements on line 60?

If you want to use LDG to load the 16-byte tetrahedron struct on line 60 then I would either make the struct a union with the int4 vector type as a member or take a more crufty approach and just cast mesh to an int4:

__ldg((int4*)mesh + tet_id)

njuffa · January 30, 2015, 9:56pm

I can’t tell how the error message from Thrust relates to your use of __ldg(). Looks like it is time to break out the debugger and see where the bad addresses creep in that ultimately cause the out-of-bounds access.

Robert_Crovella · January 30, 2015, 10:08pm

t is a local variable.

__ldg is intended to be used for global-space loads.

[url]Programming Guide :: CUDA Toolkit Documentation

(just adding to what allanmac said - your load sequence doesn’t make sense to me. You could use __ldg with a mesh-based pointer, as suggested.)

MutantJohn · January 30, 2015, 11:35pm

allanmac:

@mutantjohn, why are you treating the ints v0-v3 as pointers after you already loaded ‘t’ and its 4 elements on line 60?

If you want to use LDG to load the 16-byte tetrahedron struct on line 60 then I would either make the struct a union with the int4 vector type as a member or take a more crufty approach and just cast mesh to an int4:
__ldg((int4*)mesh + tet_id)

Won’t I have to do this for every index I want to extract from the structure though?

Accessing vertices 0 through 3 would then become :

const int v0 = __ldg((int* ) mesh + tet_id + 0);
const int v1 = __ldg((int* ) mesh + tet_id + 1);
const int v2 = __ldg((int* ) mesh + tet_id + 2);
const int v3 = __ldg((int* ) mesh + tet_id + 3);

,
correct?

Edit : The above does work though. That’s a neat trick. That’s a really neat trick.

allanmac · January 30, 2015, 11:45pm

You don’t have to do it that way.

LDG(T*) supports all the standard scalar and vector types ranging from 1 to 16 bytes.

I tend to think of all loads/stores in terms of the width, coalescibility, and transaction size (32 bytes, 64 bytes, 128 bytes, etc.).

In your case, sizeof(tetradhedron) appears to be 16 and you can efficiently load it in one instruction. The tetrahedron instance will be held in 4 32-bit registers in each thread.

MutantJohn · January 30, 2015, 11:59pm

I’m dumb!

Not gonna lie, when you said int4, I thought it was a CUDA primitive where each int was 4 bytes when in reality, it’s the size of 4 ints so that’s why it’s 16 bytes! Oh snap. This changes the game. This changes everything!

taps madly on keyboard

Edit :

Okay, is this a better access pattern or should I just flat out stop with the local thread copies?

// load in tetrahedron
        const int tet_id    = __ldg(tetra_index + tid);
        const int4 t = __ldg((int4* ) mesh + tet_id); // this is now in cache, right?

        // load in vertices
        // and these would live in the registers, correct?
        const int v0 = ((int* ) &t)[0];
        const int v1 = ((int* ) &t)[1];
        const int v2 = ((int* ) &t)[2];
        const int v3 = ((int* ) &t)[3];

Would it be best to just use the variables as I need them by calling :

((int* ) &t)[i]

?

I only ask because the __ldg() documentation says that it returns T and not T* so that’s why my value extraction looks a bit funny.

allanmac · January 31, 2015, 12:13am

I wouldn’t copy the fields. Once you’ve loaded ‘t’ it’s already sitting in 4 registers in the thread. Just access it as you would a “by value” struct on any other platform.

Also, it’s more compiler-friendly to access struct members by name (a,b,c,d) instead of indexing into an array:

struct tetra { int a; int b; int c; int d; };

Grokking vector loads and stores takes some time but the vector types are really useful.

Another thing to consider when loading structs of any size into a thread is that the CUDA compiler is really smart. For example, if you’re uniform loading an 8 float structure and not using the last two floats anywhere in the kernel the compiler will split the load into a float4 followed by a float2 load.

Topic		Replies	Views
Maxwell (sm_50) instruction: LDG.E ? CUDA Programming and Performance	25	9290	August 15, 2015
3GB can it be read as texture? CUDA Programming and Performance	25	3457	December 31, 2014
Do const __restrict__ pointers ever generate LDG.CI loads on CUDA 7? CUDA Programming and Performance	9	3954	March 5, 2015
cuda sample using __ldg()..? CUDA Programming and Performance	1	4685	April 24, 2014
What's different between LD and LDG (load from generic memory vs. load from global memory) CUDA Programming and Performance	10	12128	March 13, 2022
ldg versus textures CUDA Programming and Performance	5	6290	November 13, 2013
Unexpected LDG operations CUDA Programming and Performance	4	611	July 23, 2020
Question about Global memory and Texture memory CUDA Programming and Performance	5	1071	October 23, 2014
Is it correct to inspect the destionation regsiter value of ldmatrix PTX CUDA Programming and Performance	2	154	July 8, 2024
LDA but not addr used CUDA Programming and Performance	3	2130	August 3, 2009

Improper use of __ldg() causes illegal memory access

Related topics