Compiler runining my kernel using shared memory


I have a perfectly manually-optimized kernel. The problem is that the OpenCL JIT compiler is using 68+16 smem ( reported from -nv-verbose ).
I use a 128 threads/block, 16 registers kernel ( 100% occupancy kernel for a GT200 ).

However, with 68 bytes of OpenCL local memory( CUDA shared mem ) will use 8704 bytes per block, resulting in a poor 13% occupancy.
I want to know why the JIT compiler is allocating so much smem if I don’t even use it ! Is there any way to tell the compiler not to optimize the kernel using shared memory automatically, pls?

Btw, I really think you should let us use the C “register” keyword too ( so we can make the opposite: to force the compiler to use registers and not local/global mem ).


I believe the usage is reported per block, not per thread.

Ok so if -nv-verbose reports this

68+16 smem

really means the kernel will use 68 bytes os shared memory for the complete block and not for each thread in the block?

And the 16, what is it, pls?

At least that’s how it worked in CUDA’s cubins. All statically allocated shared memory is per-block and that’s what gets reported. Dynamically allocated shared memory isn’t reported at all but it’s also per-block.

You get more than 0 bytes even without allocating anything because kernel arguments are implicitly passed through shared memory (NVIDIA’s implementation detail, AMD uses constant memory for that).