Consider the following code:
__device__ int variable;
__global__ void empty() {
variable=gridDim.x+blockDim.x+blockIdx.x;
}
If you compile it to ptx you get
.entry _Z5emptyv
{
.reg .u32 %r<7>;
cvt.u32.u16 %r1, %ctaid.x; //cta is a block
cvt.u32.u16 %r2, %ntid.x; //nt == number of threads?
cvt.u32.u16 %r3, %nctaid.x; //ncta == number of blocks?
add.u32 %r4, %r2, %r3;
add.u32 %r5, %r1, %r4;
st.global.s32 [variable], %r5;
exit;
}
Which won’t tell us much as PTX is aimed to be machine independent and we know nothing about how the counters are stored. However if you take cubin file and plug it into decuda you get
.entry _Z5emptyv
{
.lmem 0
.smem 0
.reg 2
.bar 0
cvt.u32.u16 $r0, %ntid.y
cvt.u32.u16 $r1, s[0x0008]
add.u32 $r1, $r0, $r1
cvt.u32.u16 $r0, s[0x000c]
add.u32 $r0, $r0, $r1
mov.b32 $r1, c14[0x0000]
mov.end.u32 g[$r1], $r0
}
Where: cvt.u32.u16 X Y converts unsigned 16-bit value Y into unsigned 32-bit value X.
s[addr] is a shared memory access at address addr
c14[addr] is a constant memory access at bank 14\
g[addr] is, as you probably guessed, global memory access at address addr
So you see that altough we never allocated any variables in shared memory, the memory is actually used. Once you start using shared memory explicitly, that 16 bytes are added in ptxas output as indicated by “something + 16”.