PTXAS info ...+16 bytes

Hello everyone.
When I compile my CUDA programs with --ptxas-options=-v it lists the memory usage of each kernel. For shared memory, it’ll tell me how many bytes I allocated and then +16 bytes smem. For example, 8192+16 bytes smem. Where is this 16 bytes coming from? Even if I have no parameters and am not using threadIdx/blockIdx/etc… (the only uses I could come up with for it) it still allocates it. I’m asking because a kernel I have needs exactly 8192 bytes of shared memory per block on compute capability 1.3, and the extra 16 bytes drops my occupancy from 100% to 50%.

If anyone could shed some light into the issue, I would appreciate it. Thanks.

Hello everyone.
When I compile my CUDA programs with --ptxas-options=-v it lists the memory usage of each kernel. For shared memory, it’ll tell me how many bytes I allocated and then +16 bytes smem. For example, 8192+16 bytes smem. Where is this 16 bytes coming from? Even if I have no parameters and am not using threadIdx/blockIdx/etc… (the only uses I could come up with for it) it still allocates it. I’m asking because a kernel I have needs exactly 8192 bytes of shared memory per block on compute capability 1.3, and the extra 16 bytes drops my occupancy from 100% to 50%.

If anyone could shed some light into the issue, I would appreciate it. Thanks.

The 16 bytes are reserved for those parameters you mention (blockIdx, blockDim, gridDim, warpSize, etc) whether or not your code accesses them. There is no way to (safely) use all 16 kB of shared memory.

The 16 bytes are reserved for those parameters you mention (blockIdx, blockDim, gridDim, warpSize, etc) whether or not your code accesses them. There is no way to (safely) use all 16 kB of shared memory.

The 16 bytes are indeed threadIdx and blockIdx. They are always allocated even if you don’t use them.

I vaguely remember someone suggesting to overwrite them with an out-of bounds array access, but haven’t tried myself.
Naturally this is not a recommended practice.

The 16 bytes are indeed threadIdx and blockIdx. They are always allocated even if you don’t use them.

I vaguely remember someone suggesting to overwrite them with an out-of bounds array access, but haven’t tried myself.
Naturally this is not a recommended practice.

This can potentially work on compute capability 1.x devices (there was a post where someone did this by negative indexing a shared memory array), but will immediately terminate a kernel on Fermi, thanks to proper bounds checking.

This can potentially work on compute capability 1.x devices (there was a post where someone did this by negative indexing a shared memory array), but will immediately terminate a kernel on Fermi, thanks to proper bounds checking.

Particularly since the bytes aren’t there in Fermi at all. I didn’t mention it because the original post was concerned with compute capability 1.3. However, we should probably discourage use of this “technique” as much as possible. Maybe I should not write such posts at all.

Particularly since the bytes aren’t there in Fermi at all. I didn’t mention it because the original post was concerned with compute capability 1.3. However, we should probably discourage use of this “technique” as much as possible. Maybe I should not write such posts at all.

Thanks all. I figured it would be, but if there’s threadIdx x,y,z and blockIdx x and y (or is it blockIdx that can have three dimensions?), that’s 5 integers using 4 bytes each = 20 bytes total. Or are they maybe packed since you don’t need all 2^32 values of thread/block indexing?

I disagree - it’s always good to know your options and interesting ways to accomplish something. I think that your warnings are warranted though. But don’t hold back on the cool ways to do something. Spread creativity. You never know when you may need it, or draw on a solution to come up with an innovative way to accomplish something else.

Thanks all. I figured it would be, but if there’s threadIdx x,y,z and blockIdx x and y (or is it blockIdx that can have three dimensions?), that’s 5 integers using 4 bytes each = 20 bytes total. Or are they maybe packed since you don’t need all 2^32 values of thread/block indexing?

I disagree - it’s always good to know your options and interesting ways to accomplish something. I think that your warnings are warranted though. But don’t hold back on the cool ways to do something. Spread creativity. You never know when you may need it, or draw on a solution to come up with an innovative way to accomplish something else.

The integers are saved as shorts, and the remaining 6 bytes are filled with a few other things I’m too lazy to look up.

I forgot to mention that 50% occupancy is not that bad at all. It’s definitely not worth using such tricks just to get to 100%.
As an exercise, you might try overwriting the vars to get to 100% and (if it does not crash and gives the correct result) you will probably find that the kernel does not run faster at all.

The integers are saved as shorts, and the remaining 6 bytes are filled with a few other things I’m too lazy to look up.

I forgot to mention that 50% occupancy is not that bad at all. It’s definitely not worth using such tricks just to get to 100%.
As an exercise, you might try overwriting the vars to get to 100% and (if it does not crash and gives the correct result) you will probably find that the kernel does not run faster at all.

threadIdx doesn’t go into shared memory because it has a different value for every thread in the block. It’s stored in one of the registers when the thread starts. blockDim, however, is universal and has three dimensions. Additionally, gridDim is also universal and has two, along with blockIdx. The layout is given in another forum post:

Sure, but we just want to avoid surprising people as well, especially if they are not sure what is supported. This trick requires that you not access these parameters (or save them to registers before blowing them away) and will instantly segfault on newer GPUs. Caveat emptor. :)

threadIdx doesn’t go into shared memory because it has a different value for every thread in the block. It’s stored in one of the registers when the thread starts. blockDim, however, is universal and has three dimensions. Additionally, gridDim is also universal and has two, along with blockIdx. The layout is given in another forum post:

Sure, but we just want to avoid surprising people as well, especially if they are not sure what is supported. This trick requires that you not access these parameters (or save them to registers before blowing them away) and will instantly segfault on newer GPUs. Caveat emptor. :)