When I compile my CUDA programs with --ptxas-options=-v it lists the memory usage of each kernel. For shared memory, it’ll tell me how many bytes I allocated and then +16 bytes smem. For example, 8192+16 bytes smem. Where is this 16 bytes coming from? Even if I have no parameters and am not using threadIdx/blockIdx/etc… (the only uses I could come up with for it) it still allocates it. I’m asking because a kernel I have needs exactly 8192 bytes of shared memory per block on compute capability 1.3, and the extra 16 bytes drops my occupancy from 100% to 50%.
If anyone could shed some light into the issue, I would appreciate it. Thanks.