how does static shared memory get laid out across the banks?

I have understanding of banks, bank width, and bank conflict. But I don’t understand how multiple statically allocated shared memory is laid out through the banks.

If I assume: 4 bytes of bank width 32 banks sizeof(int) == 4 bytes

Then how would shared1 and shared2 be laid out throughout the banks? Would shared 2 be placed immidiately succeeding shared1 ( i.e. the last value of shared1 is stored in bank 0 and the first value of shared2 is stored in bank 1 )? or would the end of shared1 be padded (i.e. the last value of shared1 is stored in bank 0 and banks 1 ~ 31 are padded, and shared 2 starts from bank 0 of the next row )?

__device__ void main( void ){
    __shared__ int shared1[33];
    __shared__ int shared2[31];
    ...
}

Would this behavior be different if shared2 is allocated in a nested function?:

__device__ void main( void ){
    __shared__ int shared1[33];
    fun();
    ...
}

__device__ void fun( void ){
    __shared__ int shared2[31]
}

Finally, does the order of allocation matter?

Thanks!

As far as I know CUDA treats pointers to shared memory spaces as 32 bit. Declare pointers to shared memory arrays.

int *p1 = &shared1[0];
int *p2 = &shared2[0];

The PTX ISA converts between generic unified 64 bit pointers and pointers to specific memory spaces with this PTX instruction: cvta.xxx.u32 (where xxx is one of const, local, shared)

Try printing the numeric pointer values and look at the hexadecimal addresses for various use cases. Of course relying on the observed behavior might be problematic, as the nvcc compiler is subject to changes and this behavior is probably not guaranteed or documented anywhere.

Alignment of elements is probably occuring at the granularity of the word size used for each array. CUDA is most likely not going to try to align arrays to any particular memory bank. Custom alignment would be explained here: https://stackoverflow.com/questions/12778949/cuda-memory-alignment

I am not sure this is guaranteed? Converting 64-bit generic pointers into smaller memory-space-specific pointers for performance reasons is an optimization and as such I would not expect it to always happen, in particular not in debug builds, where all optimizations are turned of.

Valid point, njuffa. But if it’s not guaranteed then we have to guarantee it ourselves.

#include <cuda.h>
#include <stdio.h>

__device__ uint __forceinline__ shared_ptr_32bit (void *global)
{
  unsigned long long sharedptr;
  asm(" cvta.to.shared.u64 %0,%1;\n\t"
      : "=l"(sharedptr) : "l" (global));
  return (uint)sharedptr;
}

__global__ void test()
{
    __shared__ int a[5];
    __shared__ int b[7];
    __shared__ int c[9];
    extern __shared__ char d[];

    printf("&a[0] = $%08x\n", shared_ptr_32bit(&a[0]));
    printf("&b[0] = $%08x\n", shared_ptr_32bit(&b[0]));
    printf("&c[0] = $%08x\n", shared_ptr_32bit(&c[0]));
    printf("&d[0] = $%08x\n", shared_ptr_32bit(&d[0]));
}

int main(int argc, char **argv)
{
    test<<<1,1,1024>>>();
    cudaDeviceSynchronize();
}

Thanks, cbuchner1!

I don’t know ptx machine code, but I assume this prints the byte adress of shared memory in decimal ?