Shared memory layout of multiple static shared memories declaration

Hi Experts,

I am optimizing my kernel function using padding to reduce the bank conflicts.

I’d like to know the layout of multiple static shared memories declaration in a kernel.

If there’s only one shared memory in a kernel:

__global__ void exampleA(...) {
    __shared__ cuFloatComplex shmem[6][8];
    ....
}

According to the official programming documentation, I think the shared memory will be like:

If there are multiple static shared memories like:

__global__ void exampleB(...) {
    __shared__ cuFloatComplex shmemA[6][8];
    __shared__ cuFloatComplex shmemB[7 * 6][8];
    ....
}

I suspect that two shared memory arrays will be cyclically assigned to the 32 banks based on their declaration order.

Is it correct? Thanks!

There is one shared memory.
Normally it does not matter, as warp either accesses shmemA or shmemB at the same time.
But as far as I know, they are just put one after the other with the basic alignment of the type (e.g. 8 bytes).

If you use a single pointers, which can point to an element of either (shmemA or shmemB), it gets relevant.

You can create a struct, in which you define your padding.

Or you define your complex as two floats.

In your case, you could want to access 16 complex numbers, or 16 odd and 16 even banks at the same time.

You can view the shared memory space in a debugger or print out the pointer addresses.