static shared vs. extern shared?

abestephens · March 24, 2007, 2:06am

What are the unexpected differences between these styles of chopping up shared memory?

 // Block 1 Kernel invoked with 128*sizeof(unsigned int)

  extern __shared__ unsigned int shared[];

  unsigned int *shared_data = &shared[0];

  unsigned int *shared_rank = &shared[64];  // Offset could be passed in too.

  

  // Block 2

  __shared__ unsigned int shared[128];

  unsigned int *shared_data = &shared[0];

  unsigned int *shared_rank = &shared[64];

 // Block 3

  __shared__ unsigned int shared_data[64];

  __shared__ unsigned int shared_rank[64];

....

 sort_rank( shared_rank, shared_data, thread );

....

I’ve watered down my program to the above pseudo code.

I’d like to avoid static shared memory allocation and be able to size things at run time (like “Block 1”), however all but “Block 3” result in some type of error which hangs the device. sort_rank runs a fixed length loop based on blockDim. I’ve tried adding bounds checking on array access (i.e. array[min(63,(unsigned)i]), without luck. All work in the emulator.

Thanks for any advice.

Abe

Edit: Looks like this was partially addressed here (I should have scrolled down…) Although the solution in the second block of code on in that post doesn’t compile, complaining that shared pointers can’t have initializers.

jhanweck · March 25, 2007, 6:23pm

The following works on emulator and GPU. Probably other ways to do it.

The “const” attribute on the pointers it optional; it tells the compiler that vui and vf shouldn’t be changed.

Here, the size of the vui array is equal to the number of threads. You could pass the size as a parameter, too.

extern __shared__ unsigned char shmem[];

__global__ void testKernel(...) 

{

  const unsigned int tid = threadIdx.x;

  const unsigned int num_threads = blockDim.x;

	

  unsigned int * const vui = (unsigned int *) shmem;

  float * const vf = (float *) (shmem + num_threads * sizeof(unsigned int));

 ...

  vui[tid] = 0;

  vf[tid] = 0.0f;

  ...

}

abestephens · March 26, 2007, 1:32am

The following works on emulator and GPU.
 unsigned int * const vui = (unsigned int *) shmem;

  float * const vf = (float *) (shmem + num_threads * sizeof(unsigned int));
[snapback]175840[/snapback]

I believe the problem was caused by passing these pointers to a function via reference. While the compiler does handle references in certain cases (see examples in the SDK), perhaps the “shared” qualifier is lost or discarded when pointers are passed this way–leading to undefined behavior (which in this case meant device hang). The function in question worked just fine when I passed it pointers to global memory arrays or statically sized shared memory arrays.

Count this as one of the dangers of straying too far from C.

Abe

Topic		Replies	Views
Error: __shared__ variables cannot have external linkage CUDA Programming and Performance	1	2200	April 26, 2016
Shared Memory initialization CUDA Programming and Performance	19	45557	March 26, 2007
Extern pointer shared memory CUDA Programming and Performance	3	894	February 20, 2016
More question about shared mem Some point is unclear in the document CUDA Programming and Performance	4	3579	November 29, 2007
Shared memory types CUDA Programming and Performance	1	3534	June 13, 2007
Shared Mem (w/ & w/out extern) CUDA Programming and Performance	2	2352	October 2, 2009
extern __shared__ does not allocate memory CUDA Programming and Performance	1	7525	December 1, 2009
strange error about shared memory CUDA Programming and Performance	4	2378	November 30, 2007
Shared Memory extern vs "normal" Not the same behavior between dynamic shared memory and sta CUDA Programming and Performance	7	1367	November 27, 2010
extern CUDA Programming and Performance	1	1487	December 17, 2009

static shared vs. extern shared?

Related topics