Strange Compiler Shared Memory Usage

Why does compiling the following code snippet (with nvcc 2.3 on 64-bit linux) produce 16 bytes of static shared memory usage?

global void f(void){
extern shared float a;
}

In general, I’m seeing the compiler produce 16 bytes of static shared memory usage when I have a dynamically sized shared memory array, as in this example. I’d really like my 16 bytes back: for example, I have a wavelet transform that could fit twice as many thread blocks per SM if these 16 bytes were not taken.

If I am not mistaken, 16 bytes of shared memory are always taken for some internal stuff. For example, I think, gridDim, blockDim and blockIdx are stored threre.

That doesn’t seem to be it either, because simple dummy kernels that use all three of gridDim, blockDim, and blockIdx do not use any static shared memory … until you add an extern shared array.

Also, gridDim and blockIdx are both pairs of (at most) 32-bit integers, and blockDim is a triple of (at most) 32-bit integers. In all, this is 5*8 = 40 bytes. They are all at least 16-bits, which would give us 20 bytes. Even if blockDim.z is packed into 8 bits, that only gives us 19 bytes. None of these fit into 16 bytes.

Wait, I don’t follow the math:

max block dim = 512 x 512 x 64 => 2 + 2 + 1 = 5 bytes of storage

max block index = 65535 x 65535 => 2 + 2 = 4 bytes of storage

max grid dim = 65535 x 65535 => 2 + 2 = 4 bytes of storage

That gives 13 bytes, leaving room for some extra stuff like warpSize, which I think is also stored there.

Someone did some digging and figured out the memory layout for this initial 16 bytes of shared memory, but now I can’t find the post. They discovered the contents by creating a shared memory array, and then indexing it with negative values.

Consider the following code:

__device__ int variable;

__global__ void empty() {

variable=gridDim.x+blockDim.x+blockIdx.x;

}

If you compile it to ptx you get

.entry _Z5emptyv

	{

	.reg .u32 %r<7>;

	cvt.u32.u16 	%r1, %ctaid.x; //cta is a block

	cvt.u32.u16 	%r2, %ntid.x;  //nt == number of threads?

	cvt.u32.u16 	%r3, %nctaid.x; //ncta == number of blocks?

	add.u32 	%r4, %r2, %r3;

	add.u32 	%r5, %r1, %r4;

	st.global.s32 	[variable], %r5;

	exit;

	}

Which won’t tell us much as PTX is aimed to be machine independent and we know nothing about how the counters are stored. However if you take cubin file and plug it into decuda you get

.entry _Z5emptyv

{

.lmem 0

.smem 0

.reg 2

.bar 0

cvt.u32.u16 $r0, %ntid.y

cvt.u32.u16 $r1, s[0x0008]  

add.u32 $r1, $r0, $r1

cvt.u32.u16 $r0, s[0x000c]  

add.u32 $r0, $r0, $r1

mov.b32 $r1, c14[0x0000]

mov.end.u32 g[$r1], $r0

}

Where: cvt.u32.u16 X Y converts unsigned 16-bit value Y into unsigned 32-bit value X.

s[addr] is a shared memory access at address addr

c14[addr] is a constant memory access at bank 14\

g[addr] is, as you probably guessed, global memory access at address addr

So you see that altough we never allocated any variables in shared memory, the memory is actually used. Once you start using shared memory explicitly, that 16 bytes are added in ptxas output as indicated by “something + 16”.