I have a perfectly manually-optimized kernel. The problem is that the OpenCL JIT compiler is using 68+16 smem ( reported from -nv-verbose ).
I use a 128 threads/block, 16 registers kernel ( 100% occupancy kernel for a GT200 ).
However, with 68 bytes of OpenCL local memory( CUDA shared mem ) will use 8704 bytes per block, resulting in a poor 13% occupancy.
I want to know why the JIT compiler is allocating so much smem if I don’t even use it ! Is there any way to tell the compiler not to optimize the kernel using shared memory automatically, pls?
Btw, I really think you should let us use the C “register” keyword too ( so we can make the opposite: to force the compiler to use registers and not local/global mem ).