Question about extern __shared__

Hi there, I am sure some of you can help me to solve my question.

In the CUDA programming guide they say:

“Note that pointers need to be aligned to the type they point to, so the following code, for example, does not work since array1 is not aligned to 4 bytes”

extern shared float array;
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];

But…is it correct this ?

extern shared char array;
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];

I am not sure.

Another question, in Table 3. Alignment Requirements in Device Code why they don include bool type? Is it the same that char ?

Thank you very much !!

I don’t think it matters how you define array for dynamic shared memory. Since you are going to cast it to something else anyway, you can think of it as raw memory. array1 would still have to point to an address that is a multiple of 4.

However, if you are always going to be using shared memory for floats, and you define array as float, then you have the convenience of not needing to cast.

GPUs require naturally alignment of data on the device, i.e. 2-byte quantities need to be aligned on 2-byte boundaries, 4-bytes quantities on 4-byte boundaries, and so on. If objects of different unit size are used inside an extern shared block, you would want to declare it as being of the largest type you plan to store in the shared memory. Then declare the objects in order of decreasing unit size and they will automatically fufil the alignment requirement. E.g.

#define K 7
#define M 3
#define N 5

extern __shared__ double shmem[];

double *myDoubles = (double*)shmem;          // array of K doubles
int **myIntPtrs = (int *)(myDoubles + K);    // array of M pointers to int
float *myFloats = (float *)(myIntPtrs + M);  // array of N floats
short *myShorts = (short *)(myFloats + N);

Thank you very much to both. That is usefull for my purpose.

I have doubles, floats, ints and booleans. So I am going to do as njuffa proposes:

extern __shared__ double shared[];

double *myDoubles = (double*)shared; // array of K doubles
double *aDouble = (double *)(myDoubles + K); // a only Double
float *myFloats = (float *)(aDouble + 1); // array of M floats
float *aFloat = (float*)(myFloats + M); // 1 only float 
int *myInts = (int *)(aFloat + 1); // array of N ints
bool *myBools = (bool *)(myInts + N);

Only a question njuffa, why do you write int **myIntPtrs before floats ?

Thanks !!!

Pointers are either 8 bytes or 4 bytes, depending on whether the host platform is 64-bit or 32-bit. For portability of code and data between host and device, CUDA uses the pointer size native to the host system for device code as well.

As for the size of bool: Best I know it is implementation-defined per the ISO C++ standard. So it would be best to check this by examining the result of sizeof(bool).