Bug(?) with short datatype in shared memory

Hi all,

I’m seeing a curious behavior in one of my CUDA kernel. It seems that the behavior of the shared memory is dependant on the way it is declared for small (<4 bytes) datatype.

I have a code that works on unsigned short (16 bits). To allow coalescing on 1.0/1.1 hardware, the final stage (a XOR of a global memory element with a locally produced value) is made by half the threads on unsigned int (32 bits). It’s obviously a lot faster. BUT, of the two equivalent way of declaring the shared memory, one work, and one doesn’t.

  • If I do this to:
__shared__ unsigned int alignbuffer_int[N_THREADS/2];

unsigned short *alignbuffer = (unsigned short*)alignbuffer_int;

Then each thread store its unsigned short element in alignbuffer[tid], and the first half of the threads XOR the unsigned int element alignbuffer_int[tid] with the corresponding global memory element: works fine.

  • But if I do this, which I believe should produce the exact same results:
__shared__ unsigned short alignbuffer[N_THREADS];

unsigned int *alignbuffer_int = (unsigned int*)alignbuffer;

Then the results are all wrong. This was observed with both CUDA 1.1 on 1.0 HW, and CUDA 2.0 on 1.3 HW.

I assume that when I declare “unsigned short alignbuffer[N_THREADS];”, each short is put in a different bank of the shared memory, thus possibly improving performance but breaking the expected semantics. Or maybe it’s something completely different that I fail to understand.

I think it’s a bug to break semantics that way, and I haven’t found anything in the documentation about that curious behavior.

Any help (including which page of the documentation I’ve missed :-) welcome.

I do not know if that is the reason, but even on a completely normal (non-x86, maybe ARM) CPU there is no reason for this code to work. An unsigned short buffer only needs to be aligned by sizeof(short) = 2 bytes, whereas an access to an int may need to be aligned by sizeof(int) = 4 bytes.

On x86 violating that only costs you performance though.

You can test if that is the reason by adding a proper alignment statements (check the programming guide for which style nvcc uses for that), but IMO better avoid it if you can.

Yes! Something like this was reported earlier and it was attributed to alignment – as far as I could remember

Indeed, no reason, unless (like me) you’re so used to cast inside vector code where everything is the same size and alignment anyway. Fixing the alignment solves the problem.

Thanks for helping my poor brain.

I just ran into the same issue.

By the way, the documentation (CUDA 2.0) gives this example in section 4.2.2.3 :

extern __shared__ char array[];

__device__ void func()      // __device__ or __global__ function

{

    short* array0 = (short*)array;

    float* array1 = (float*)&array0[128];

    int*   array2 =   (int*)&array1[64];

}

which looks broken, or at least misleading.

I think this should be fixed/clarified…

I believe the ‘extern’ there makes a difference. But yes, misleading. I wonder, does align work on array declarations, or just type declarations?