LMEM reported by compiler question

Hi All,

I was running some tests today to try to understand the ptxas output from my compiler and some questions came up that I have not manage to find an answer to. Im running with CUDA 5.5 toolkit

The following test kernels where compiled under sm_13:

extern "C"
__global__ void registerTestKernel_UINT(unsigned int *out)
{
  int tid = blockDim.x*blockIdx.x + threadIdx.x;
  int val;
  val = 0;
  val += tid;
  val += tid*2;
  out[tid] = val; 
}

ptxas output :

ptxas : info : Compiling entry function ‘registerTestKernel_UINT’ for ‘sm_13’
ptxas : info : Used 4 registers, 20 bytes smem, 8 bytes cmem[1]

extern "C"
__global__ void registerTestKernel_UINT_ARRAY(unsigned int *out)
{
  int tid = blockDim.x*blockIdx.x + threadIdx.x;
  int val[1];
  val[0] = 0;
  val[0] += tid;
  val[0] += tid*2;
  out[tid] = val[0]; 
}

ptxas output :

ptxas : info : Compiling entry function ‘registerTestKernel_UINT_ARRAY’ for ‘sm_13’
ptxas : info : Used 4 registers, 20 bytes smem, 8 bytes cmem[1], 4 bytes lmem

extern "C"
  __global__ void registerTestKernel_UINT_1(uint1 *out)
{
  int tid = blockDim.x*blockIdx.x + threadIdx.x;
  uint1 val;
  val.x = 0;
  val.x += tid;
  val.x += tid*2;
  out[tid] = val; 
}

ptaxs output :
ptxas : info : Compiling entry function ‘registerTestKernel_UINT_1’ for ‘sm_13’
ptxas : info : Used 4 registers, 20 bytes smem, 8 bytes cmem[1], 4 bytes lmem

As you can see I get lmem 4 bytes for ‘registerTestKernel_UINT_1’ and ‘registerTestKernel_UINT_ARRAY’.

In the array case : I know that registers can not be indexed and the compiler will put it in local memory space if it can not sort out the indexing but I thought that the indexing should be quite straight forward to figure out for the compiler in my case?

Using uint1 : Here I thought that the compiler should be able to treat this as an normal unsigned int but apperently not. Does anyone know why? All the build in types (int1, int2 etc ) seems to be put in local memory space.

Another question :

If I compile it for sm_3 I get the following ptxas

ptxas : info : Compiling entry function ‘registerTestKernel_UINT_ARRAY’ for ‘sm_30’
1> ptxas : info : Function properties for registerTestKernel_UINT_ARRAY
1> 8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 6 registers, 8 bytes cumulative stack size, 324 bytes cmem[0]

1> ptxas : info : Compiling entry function ‘registerTestKernel_UINT_1’ for ‘sm_30’
1> ptxas : info : Function properties for registerTestKernel_UINT_1
1> 8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 6 registers, 8 bytes cumulative stack size, 324 bytes cmem[0]

1> ptxas : info : Compiling entry function ‘registerTestKernel_UINT’ for ‘sm_30’
1> ptxas : info : Function properties for registerTestKernel_UINT
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 5 registers, 324 bytes cmem[0]

The difference here is that I use one more register and cumulative stack size gets 8 bytes. What is the cumulative stack size and why does it increase with 8 bytes. From the ptx files I can still see that the variable int val[1] in the array test kernel gets declared as a local variable and converting the local state pointer to generic addresses

.local .align 4 .b8 	__local_depot1[4];
	.reg .b32 	%SP;
	.reg .b32 	%SPL;
	.reg .s32 	%r<16>;


	.loc 1 127 1
func_begin1:
	.loc	1 127 0

	.loc 1 127 1

	mov.u32 	%SPL, __local_depot1;
	cvta.local.u32 	%SP, %SPL;

Hope someone can give me some insight =)