How to use all 16KB shared memory

I can confirm that this works for my application.

Is the ‘volatile’ trick required? I didn’t use it as I read your post after changing my application. In my case it seems to work without volatile, but maybe I was just lucky that the compiler produced the required code…

Thanks for your input, it solved my problems, although the approach seems to be a bit risky.

I’ll code a slower fallback for the case it doesn’t work.

Hi Jamie,

I’m trying to do now a sum reduction on a large array (~1000-5000 floats) which won’t fit into shared mem (I won’t know the size of the array till runtime).

Take a look at :…an/doc/scan.pdf

sepcifically look at the “Arrays of arbitrary size”… , this is a more elegant way of what I wrote in previous posts.


It is not required if you are either not going to use your input arguments after shared memory is trashed, or if you are sure that parameter has ended up in the register. I’d recommend to always check ptx code. In case compiler decides to read input args too late, volatile trick is a 100% to make it stop doing so (at least in 2.2 version of the toolkit).

Sure it is, you shouldn’t probably expect this to work on future hardware.

I’m just going to say that if you ever ship code that does this and then I have to support it when it breaks, I will personally hunt you down :)

Ha ha ha… :-)

While you are sharpening the axe: Is there a reason these values can’t go into constant memory? I expect the parameters to almost exclusively be accessed in a broadcast mode, which would make the constant cache ideal.

Where will u have “blockID” in Constat memory? To access it, you will still need a blockId and thus a chicken-egg prob

The gridDim and blockDim can go in constant memory as they are constant to the whole invocation.

The constant cache is not specific to a shader multiprocessor (CUDA Processor). Some special variables, like BlockIdx are specific to a thread block and therefore the constant memory data would have to be partitioned: multiple copies of the special variables need to be placed there. Maybe the guys who wrote the thread block scheduler reasoned that it was easier to implement by using shared memory.

Ah right, never mind. Still, if there was some other place to stash the 256 bytes (or less) of parameters per block, it would make life a lot easier for people with algorithms that like large power-of-2 blocks of shared memory. Hopefully that’s on tmurray’s spreadsheet somewhere…

Arguments, blockDim and gridDim are common to all blocks/threads. They can be placed in global memory in a reserved (read-only) area for the kernel.

The blockId and threadId are anyway available in the registers, I believe. So, we really dont need to store them anywhere.

If we can make this arrangement - all 16K can come to the kernel’s hand. It is upto the programmer to eithe cache parameters in sharedMem or Registers or not cache them at all. – Just my 2 cents!

Further, I am personally happy with the way parameters are stored in shared memory. I have no complaints.

So, this feature could be an optional parameter specified during a kernel launch.

Global memory? Are you sure that’s a good idea?

Trying this idea on different kernels I discovered that ‘volatile’ trick doesn’t actually do anything. It definitely affects ptx code, and according to ptx code volatile keyword does force a variable to appear in the register. But decuda shows that the final binary compiled has code completely reshuffled and undoes the effect of volatile keyword, making references to input arguments after you’ve trashed the shared memory parameter area.
I’d recommend to disregard ptx code and always check decuda output to be sure that overwriting parameter area doesn’t make your kernel crash, and if you can’t make the code stop referencing input arguments in the wrong places (which is what I experience now), move parameters to constants.
Such behavior also probably means that new compiler versions released may make your working kernel break when using this hack.

I dont know. But u can always cache them in registers (local) at start of day – thus each thread will ahve a local copy of parameters.

But most likely, parameters only contain reference to global pointers and integers and floats.

Also if you have too much arguments, we usually allocate a struct in global memory and pass the pointer to the kernel. So, if you could imagine that case – then all kernels can be made (1 argument) kernel and so on…

Remember that reading global memory in a coalesced manner means each thread must read it’s own distinct word (or double/quad word). If all threads must read one address, there’s constant memory with broadcasting that works for this. And it’s cached.

that’s not true as of Compute 1.2, global memory has broadcast capabilities

Oh, I only worked on 1.1. Good to know!

By using

shared float s_Y[8][8][8][8];

I get

uses too much shared data (0x402c bytes + 0x10 bytes system, 0x4000 max)

I guess that this is the same problem?

Yes, it is the same problem. If you add or remove a kernel parameter you’ll see that 2c changes, but you’ll never get it down to 0.

DID YOU KNOW: this will almost certainly break on GF100.

this should work, right? ;)

if (compute_capability < 2.0)