Hi there, I am sure some of you can help me to solve my question.
In the CUDA programming guide they say:
“Note that pointers need to be aligned to the type they point to, so the following code, for example, does not work since array1 is not aligned to 4 bytes”
I don’t think it matters how you define array for dynamic shared memory. Since you are going to cast it to something else anyway, you can think of it as raw memory. array1 would still have to point to an address that is a multiple of 4.
However, if you are always going to be using shared memory for floats, and you define array as float, then you have the convenience of not needing to cast.
GPUs require naturally alignment of data on the device, i.e. 2-byte quantities need to be aligned on 2-byte boundaries, 4-bytes quantities on 4-byte boundaries, and so on. If objects of different unit size are used inside an extern shared block, you would want to declare it as being of the largest type you plan to store in the shared memory. Then declare the objects in order of decreasing unit size and they will automatically fufil the alignment requirement. E.g.
#define K 7
#define M 3
#define N 5
extern __shared__ double shmem[];
double *myDoubles = (double*)shmem; // array of K doubles
int **myIntPtrs = (int *)(myDoubles + K); // array of M pointers to int
float *myFloats = (float *)(myIntPtrs + M); // array of N floats
short *myShorts = (short *)(myFloats + N);
Pointers are either 8 bytes or 4 bytes, depending on whether the host platform is 64-bit or 32-bit. For portability of code and data between host and device, CUDA uses the pointer size native to the host system for device code as well.
As for the size of bool: Best I know it is implementation-defined per the ISO C++ standard. So it would be best to check this by examining the result of sizeof(bool).