static shared vs. extern shared?

What are the unexpected differences between these styles of chopping up shared memory?

 // Block 1 Kernel invoked with 128*sizeof(unsigned int)

  extern __shared__ unsigned int shared[];

  unsigned int *shared_data = &shared[0];

  unsigned int *shared_rank = &shared[64];  // Offset could be passed in too.


  // Block 2

  __shared__ unsigned int shared[128];

  unsigned int *shared_data = &shared[0];

  unsigned int *shared_rank = &shared[64];

 // Block 3

  __shared__ unsigned int shared_data[64];

  __shared__ unsigned int shared_rank[64];


 sort_rank( shared_rank, shared_data, thread );


I’ve watered down my program to the above pseudo code.

I’d like to avoid static shared memory allocation and be able to size things at run time (like “Block 1”), however all but “Block 3” result in some type of error which hangs the device. sort_rank runs a fixed length loop based on blockDim. I’ve tried adding bounds checking on array access (i.e. array[min(63,(unsigned)i]), without luck. All work in the emulator.

Thanks for any advice.


Edit: Looks like this was partially addressed here (I should have scrolled down…) Although the solution in the second block of code on in that post doesn’t compile, complaining that shared pointers can’t have initializers.

The following works on emulator and GPU. Probably other ways to do it.

The “const” attribute on the pointers it optional; it tells the compiler that vui and vf shouldn’t be changed.

Here, the size of the vui array is equal to the number of threads. You could pass the size as a parameter, too.

extern __shared__ unsigned char shmem[];

__global__ void testKernel(...) 


  const unsigned int tid = threadIdx.x;

  const unsigned int num_threads = blockDim.x;


  unsigned int * const vui = (unsigned int *) shmem;

  float * const vf = (float *) (shmem + num_threads * sizeof(unsigned int));


  vui[tid] = 0;

  vf[tid] = 0.0f;



I believe the problem was caused by passing these pointers to a function via reference. While the compiler does handle references in certain cases (see examples in the SDK), perhaps the “shared” qualifier is lost or discarded when pointers are passed this way–leading to undefined behavior (which in this case meant device hang). The function in question worked just fine when I passed it pointers to global memory arrays or statically sized shared memory arrays.

Count this as one of the dangers of straying too far from C.