Is the ‘volatile’ trick required? I didn’t use it as I read your post after changing my application. In my case it seems to work without volatile, but maybe I was just lucky that the compiler produced the required code…
Thanks for your input, it solved my problems, although the approach seems to be a bit risky.
I’ll code a slower fallback for the case it doesn’t work.
It is not required if you are either not going to use your input arguments after shared memory is trashed, or if you are sure that parameter has ended up in the register. I’d recommend to always check ptx code. In case compiler decides to read input args too late, volatile trick is a 100% to make it stop doing so (at least in 2.2 version of the toolkit).
Sure it is, you shouldn’t probably expect this to work on future hardware.
While you are sharpening the axe: Is there a reason these values can’t go into constant memory? I expect the parameters to almost exclusively be accessed in a broadcast mode, which would make the constant cache ideal.
The constant cache is not specific to a shader multiprocessor (CUDA Processor). Some special variables, like BlockIdx are specific to a thread block and therefore the constant memory data would have to be partitioned: multiple copies of the special variables need to be placed there. Maybe the guys who wrote the thread block scheduler reasoned that it was easier to implement by using shared memory.
Ah right, never mind. Still, if there was some other place to stash the 256 bytes (or less) of parameters per block, it would make life a lot easier for people with algorithms that like large power-of-2 blocks of shared memory. Hopefully that’s on tmurray’s spreadsheet somewhere…
Trying this idea on different kernels I discovered that ‘volatile’ trick doesn’t actually do anything. It definitely affects ptx code, and according to ptx code volatile keyword does force a variable to appear in the register. But decuda shows that the final binary compiled has code completely reshuffled and undoes the effect of volatile keyword, making references to input arguments after you’ve trashed the shared memory parameter area.
I’d recommend to disregard ptx code and always check decuda output to be sure that overwriting parameter area doesn’t make your kernel crash, and if you can’t make the code stop referencing input arguments in the wrong places (which is what I experience now), move parameters to constants.
Such behavior also probably means that new compiler versions released may make your working kernel break when using this hack.
I dont know. But u can always cache them in registers (local) at start of day – thus each thread will ahve a local copy of parameters.
But most likely, parameters only contain reference to global pointers and integers and floats.
Also if you have too much arguments, we usually allocate a struct in global memory and pass the pointer to the kernel. So, if you could imagine that case – then all kernels can be made (1 argument) kernel and so on…
Remember that reading global memory in a coalesced manner means each thread must read it’s own distinct word (or double/quad word). If all threads must read one address, there’s constant memory with broadcasting that works for this. And it’s cached.