Bug(?) with short datatype in shared memory

Romain_DOLBEAU · August 22, 2008, 7:23am

Hi all,

I’m seeing a curious behavior in one of my CUDA kernel. It seems that the behavior of the shared memory is dependant on the way it is declared for small (<4 bytes) datatype.

I have a code that works on unsigned short (16 bits). To allow coalescing on 1.0/1.1 hardware, the final stage (a XOR of a global memory element with a locally produced value) is made by half the threads on unsigned int (32 bits). It’s obviously a lot faster. BUT, of the two equivalent way of declaring the shared memory, one work, and one doesn’t.

If I do this to:

__shared__ unsigned int alignbuffer_int[N_THREADS/2];

unsigned short *alignbuffer = (unsigned short*)alignbuffer_int;

Then each thread store its unsigned short element in alignbuffer[tid], and the first half of the threads XOR the unsigned int element alignbuffer_int[tid] with the corresponding global memory element: works fine.

But if I do this, which I believe should produce the exact same results:

__shared__ unsigned short alignbuffer[N_THREADS];

unsigned int *alignbuffer_int = (unsigned int*)alignbuffer;

Then the results are all wrong. This was observed with both CUDA 1.1 on 1.0 HW, and CUDA 2.0 on 1.3 HW.

I assume that when I declare “unsigned short alignbuffer[N_THREADS];”, each short is put in a different bank of the shared memory, thus possibly improving performance but breaking the expected semantics. Or maybe it’s something completely different that I fail to understand.

I think it’s a bug to break semantics that way, and I haven’t found anything in the documentation about that curious behavior.

Any help (including which page of the documentation I’ve missed :-) welcome.

Reimar · August 22, 2008, 9:05am

I do not know if that is the reason, but even on a completely normal (non-x86, maybe ARM) CPU there is no reason for this code to work. An unsigned short buffer only needs to be aligned by sizeof(short) = 2 bytes, whereas an access to an int may need to be aligned by sizeof(int) = 4 bytes.

On x86 violating that only costs you performance though.

You can test if that is the reason by adding a proper alignment statements (check the programming guide for which style nvcc uses for that), but IMO better avoid it if you can.

Sarnath · August 22, 2008, 11:11am

Yes! Something like this was reported earlier and it was attributed to alignment – as far as I could remember

Romain_DOLBEAU · September 5, 2008, 6:52am

Indeed, no reason, unless (like me) you’re so used to cast inside vector code where everything is the same size and alignment anyway. Fixing the alignment solves the problem.

Thanks for helping my poor brain.

Sylvain_Collange · September 8, 2008, 2:14pm

I just ran into the same issue.

By the way, the documentation (CUDA 2.0) gives this example in section 4.2.2.3 :

extern __shared__ char array[];

__device__ void func()      // __device__ or __global__ function

{

    short* array0 = (short*)array;

    float* array1 = (float*)&array0[128];

    int*   array2 =   (int*)&array1[64];

}

which looks broken, or at least misleading.

I think this should be fixed/clarified…

alex_dubinsky · September 8, 2008, 4:27pm

I just ran into the same issue.

By the way, the documentation (CUDA 2.0) gives this example in section 4.2.2.3 :
extern __shared__ char array[];

__device__ void func() Â  Â  Â // __device__ or __global__ function

{

 Â  Â short* array0 = (short*)array;

 Â  Â float* array1 = (float*)&array0[128];

 Â  Â int* Â  array2 = Â  (int*)&array1[64];

}
which looks broken, or at least misleading.

I think this should be fixed/clarified…

[snapback]437059[/snapback]

I believe the ‘extern’ there makes a difference. But yes, misleading. I wonder, does align work on array declarations, or just type declarations?

Topic		Replies	Views
Weird behaviour with static shared memory with short ints CUDA Programming and Performance	6	870	December 12, 2019
Arrays in shared mem How to handle non-4byte elements ? CUDA Programming and Performance	4	2301	July 3, 2008
shared memory alignment CUDA Programming and Performance	1	4761	May 17, 2010
Wrong shared memory allocation CUDA Programming and Performance	1	1164	May 20, 2009
Question about extern __shared__ CUDA Programming and Performance	4	2131	March 10, 2013
Shared Memory (Unaligned Memory access) CUDA Programming and Performance	6	1948	December 22, 2009
More question about shared mem Some point is unclear in the document CUDA Programming and Performance	4	3581	November 29, 2007
Memory Access CUDA Programming and Performance	2	1162	June 22, 2009
misalignment of shared memory? CUDA Programming and Performance	3	4222	January 25, 2013
Possible Alignment Issue CUDA Programming and Performance	5	1238	October 28, 2010

Bug(?) with short datatype in shared memory

Related topics