I’m seeing a curious behavior in one of my CUDA kernel. It seems that the behavior of the shared memory is dependant on the way it is declared for small (<4 bytes) datatype.
I have a code that works on unsigned short (16 bits). To allow coalescing on 1.0/1.1 hardware, the final stage (a XOR of a global memory element with a locally produced value) is made by half the threads on unsigned int (32 bits). It’s obviously a lot faster. BUT, of the two equivalent way of declaring the shared memory, one work, and one doesn’t.
- If I do this to:
__shared__ unsigned int alignbuffer_int[N_THREADS/2]; unsigned short *alignbuffer = (unsigned short*)alignbuffer_int;
Then each thread store its unsigned short element in alignbuffer[tid], and the first half of the threads XOR the unsigned int element alignbuffer_int[tid] with the corresponding global memory element: works fine.
- But if I do this, which I believe should produce the exact same results:
__shared__ unsigned short alignbuffer[N_THREADS]; unsigned int *alignbuffer_int = (unsigned int*)alignbuffer;
Then the results are all wrong. This was observed with both CUDA 1.1 on 1.0 HW, and CUDA 2.0 on 1.3 HW.
I assume that when I declare “unsigned short alignbuffer[N_THREADS];”, each short is put in a different bank of the shared memory, thus possibly improving performance but breaking the expected semantics. Or maybe it’s something completely different that I fail to understand.
I think it’s a bug to break semantics that way, and I haven’t found anything in the documentation about that curious behavior.
Any help (including which page of the documentation I’ve missed :-) welcome.