Newbie question: Shared memory

According to the Programming Guide.

__global__ function parameters are currently passed via shared memory to the device and limited to 256 bytes.

I really don’t understand what it means.

Could anyone give me an example?

Thank you.


Sure. It’s just a sum of the arguments to the global kernel call.

__global__ void FindExceptions(unsigned int steps, 

          unsigned int base,

          unsigned int *dCurrentN);

That’s 3 arguments, using 3*4=12 bytes

__global__ void FindExceptions(float4 start, float4 end, int count)

That’s three arguments using 36 bytes.

Don’t go over 256. And even if you use just a few arguments, remember it uses up some of your shared memory.

It’s very helpful.
Thank you very much.


Hooking up… :)

I’m also learning CUDA now and I’m a bit confused about shared memory and the cubin file.

Question 1:

This is a kernel:

__global__ void shareKernel()


	__shared__ float array[128];

	array[0] = 0;


The copy in there is for the compiler not to optimize the kernel into an empty binary (since besides allocating the array nothing happens). Without it the cubin file was ‘all zeros’. Off topic: how to ask nvcc not to do anything with it (= compile as is)?

Anyhow, I’d expect this code to use 128sizeof(float) = 1284=512 bytes. But the cubin file looks like so:

code  {

	name = _Z11shareKernelv

	lmem = 0

	smem = 528

	reg = 1

	bar = 0

	bincode  {

  0x10008001 0x00000003 0x00000801 0xe4200781 



Why is it bigger than 512? I’m missing 16 bytes. What else is there that I don’t know about? There are no parameters so nothing to be passed to the function though I don’t think the cubin file shows those bytes anyway. Maybe it’s because of the array[0]=0 line?

Question 2:

What is the difference between passing the size of shared memory I need in the kernel config versus simply passing zero and declaring an array in the kernel itself?


Those missing 16 bytes are somehow related to my previous posting on kernel parameter shared memory usage. There seems to be a magical 16 byte minimum overhead. You can see more information if you do:

nvcc -ptx

ptxas -v mycode.ptx

This basically tells nvcc to generate PTX code instead of a CUBIN file. ptxas then converts the PTX code into a CUBIN file but the -v option will spit out register, shared memory, and constant memory usage. You’ll often find the shared memory reported as “N+M bytes smem” where N>=16, M>=16 and N>=M.

Question 2:

You pass the size of shared memory usage when you have ‘dynamically’ allocated arrays in shared memory. I have yet to get a clear answer on this but this is my understanding: if the compiler can determine how much shared memory a variable takes up, you apparently don’t have to account for that space again when you set the shared memory size at kernel launch time. However, if you have a ‘dynamic’ array:

extern __shared__ unsigned int my_array[];

(My syntax could be wrong. I don’t work in CUDA device code very often) then the compiler has no idea how big this array is. Therefore you must provide a single value that accounts for all the space your dynamic arrays need and it’s up to your kernel to work within those limits.

I hope that wasn’t too vague.


OK, I will for now just remember about this and maybe someday someone’ll explain this in detail :)


BTW: You probably know that but instead of invoking nvcc and later ptxas you can just invoke nvcc with: –ptxas-options=-v which does the same :)


This is however still a bit unclear…

The only place a shared memory can be allocated is within the kernel right? Or, and that’s the second option, in the kernels config. So if I get this right to use the first version I do:

__global__ void shareKernel()


__shared__ float array[128];


and the second:

__global__ void shareKernel()


extern __shared__ float* array;


// and invoke:

shareKernel<<< G, B, sizeof(float)*128>>>();

//and make sure I don't go over 127 or something (what?) will happen

and both do pretty much the same. Is that correct?

the second version is dynamic, as in on runtime you can change sizes (which can be useful if you also change amount of threads e.g. at runtime.

Oh, you’re right :biggrin:

Damn, what was I thinking… :argh:

Thank you for clarifying this for me!

So, it’s like this:

I can either declare static shared mem variables:

__shared__ float F;

__shared__ float arr[512];

and access them normally [but the size is constant, computed at compile time]


declare just the size [in bytes] that a single block will be using like so:

kernel<<<G,B, dynamically calculated size in bytes>>>();

and in the kernel I then require the shared memory pointer plus some initializing to get the equivalent, like so:

extern __shared__ char* s_mem; // this now points to the beginning of shared memory

float* F = (float*) s_mem; // and that's where F is located

float* arr = (float*) (s_mem+sizeof(float*)); // the 512 floats come later

right? Now both do the same except that in the latter I access F via pointer.

Assuming it’s OK a question arises:

What will happen when I also want a static shared variable? Like so:

kernel<<<G,B, dynamically calculated size in bytes>>>();


__global__ void kernel()


extern __shared__ char* s_mem; // this now points to the beginning of shared memory

float* F = (float*) s_mem; // and that's where F is located

float* arr = (float*) (s_mem+sizeof(float*)); // the 512 floats come later

// and now I might want to add:

__shared__ short tmp[128];


Will tmp begin at s_mem + dynamically calculated size in bytes? (plus maybe some alignment but that’s not the point)

I hope that it does - that would mean I get it :)