Memory access violations in kernel code when handling arrays of short integers

I originally posted this problem inside this thread about an implementation of [font=“Courier New”]atomicAdd [/font]for [font=“Courier New”]short[/font] integers, but thought the problem might be more of a general nature.

The problem I’m having currently is whereas the following test program works perfectly (credit goes to tera and Sylvain Collange for the [font=“Courier New”]atomicAddShort()[/font] function, see above link):

__global__ void test_kernel ( short* d_data)

{

        __shared__ short s_data[4];

        atomicAddShort(&d_data[0], 32767);

        atomicAddShort(&d_data[1], -32768);

        atomicAddShort(&d_data[2], 1);

        atomicAddShort(&d_data[3], -1);

s_data[0] = d_data[0];

        s_data[1] = d_data[1];

        s_data[2] = d_data[2];

        s_data[3] = d_data[3];

d_data[2] = s_data[2]-1;

        d_data[3] = s_data[3]+1;

        d_data[0] = s_data[0]+1;

        d_data[1] = s_data[1]-1;

}

//tera's signed version

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : (unsigned short)val;

unsigned int long_old = atomicAdd(base_address, long_val);

    if((size_t)address & 2) {

        return (short)(long_old >> 16);

    } else {

        unsigned int overflow = ((long_old & 0xffff) + long_val) & 0xffff0000;

        if (overflow)

            atomicSub(base_address, overflow);

        return (short)(long_old & 0xffff);

    }

}

int main (void)

{

        short h_data[4] = {0};

        short* d_data;

cudaMalloc((void**)&d_data, 4*sizeof( short));

cudaMemcpy(d_data, h_data, 4*sizeof( short), cudaMemcpyHostToDevice);

test_kernel<<<1,1>>>(d_data);

cudaMemcpy(h_data, d_data, 4*sizeof( short), cudaMemcpyDeviceToHost);

printf("%d\n%d\n%d\n%d\n", h_data[0], h_data[1], h_data[2], h_data[3]);

return 0;

}

It does not work when the [font=“Courier New”]atomicAddShort()[/font] function is applied in the following way (ie. to operate on the array values stored in shared mem):

__global__ void test_kernel ( short* d_data)

{

	__shared__ short s_data[4];

	s_data[0] = d_data[0];

	s_data[1] = d_data[1];

	s_data[2] = d_data[2];

	s_data[3] = d_data[3];

	atomicAddShort(&s_data[0], 32767);

	atomicAddShort(&s_data[1], -32768);

	atomicAddShort(&s_data[2], 1);

	atomicAddShort(&s_data[3], -1);

	d_data[2] = s_data[2]-1;

	d_data[3] = s_data[3]+1;

	d_data[0] = s_data[0]+1;

	d_data[1] = s_data[1]-1;

}

Would anyone have any insight as to why this may be?

A memory alignment issue?

Cheers,

Mike

What compute capability is your device? Shared memory atomics require 1.2 or above.

It’s 1.3 and being compiles with options [font=“Courier New”]compute_13,sm_13[/font]

What happens if you replace the atomicAdd()s inside the atomicAddShort() with simple “+”? Of course the results will be wrong - but does it crash?

I think I know what’s happening here - after the cast to size_t and back to unsigned int* the compiler probably loses track of the fact that it’s a shared memory address.
Let me try to think of a solution…

It’s suboptimal, but it’s the only solution I could come up with quickly.

unsigned int *base_address = (unsigned int *) ((size_t)address & 2 ? address-1 : address);

And it’s not tested either…

tera, you are a legend!

this seems to have fixed it.

can you briefly explain what just changed?

and when you say it’s suboptimal, is this just because it introduces one extra block of divergence?

I said “suboptimal” because that line probably compiles to 4 instructions (and, compare, subtract, select value), while technically a single one would do. Then again, the condition code from the first two instructions hopefully gets reused, so we added only 2 instructions.

If the compiler doesn’t reuse the condition code, this casting orgie should also compile to 2 instructions:

unsigned int *base_address = (unsigned int *) ((char *)address - ((size_t)address & 2));