I’m developing an application that would optimally use all 16K of shared memory for data, but since CUDA passes global function arguments via the shared memory, the full 16K isn’t available.
Is there any way to recover the shared memory used for argument passing after copying the arguments to thread registers?
Is there a performance penalty in writing the values to constant memory and accessing them vs. CUDA putting the parameter values in shared memory before the kernel starts? If so, what would the penalty be?
You can’t use 16k shared mem even if you don’t have parameter. Some is reserved for blockIdx, threadIdx and stuff
There shouldn’t be much penalty of passing via const unless there’re divergently-read look-up tables.
CUDA adds a 16-byte shared memory overhead to all kernels. But I’m not sure what it contains. My guess is that threadIdx, blockIdx and other such variables are stored in registers, not shared memory.
In disassembled cubin it appears those parameters (except for threadIdx, which is passed in register r0) are in a separately addressed piece of shared memory, which is read-only. It is completely separate from the rw memory we call ‘shared memory’
Parameters start at offset 0x10 (4*0x4) of the writable shared memory area. Other declared shared memory variables are immediatly after that. As to what is at offset 0x00 I don’t know. These are indeed 16 bytes overhead. I should try reading them out some time and see if they contain the same as the %blockIdx registers.
Seems I was completely wrong about my separate registers, the block parameters are in shared memory just like anything else, and thereby, writable too:
So… yes, the first 16 (8*2) bytes of shared memory are the block parameters, and you can never get rid of that overhead. You can overwrite them though if you feel really lucky :P (by indexing into negative shared memory)
Parameter 0x9, 0xA, 0xB etc don’t exist, those would be the normal (user parameters).
I played around with this some and I don’t think these are missing parameters. I think that the compiler puts the dynamic shared memory after the kernel parameters and then aligns it to the next 16-byte boundary. Since you have one parameter, it skips three words and then starts the dynamic shared segment.