Hi All,
I was running some tests today to try to understand the ptxas output from my compiler and some questions came up that I have not manage to find an answer to. Im running with CUDA 5.5 toolkit
The following test kernels where compiled under sm_13:
extern "C"
__global__ void registerTestKernel_UINT(unsigned int *out)
{
int tid = blockDim.x*blockIdx.x + threadIdx.x;
int val;
val = 0;
val += tid;
val += tid*2;
out[tid] = val;
}
ptxas output :
ptxas : info : Compiling entry function ‘registerTestKernel_UINT’ for ‘sm_13’
ptxas : info : Used 4 registers, 20 bytes smem, 8 bytes cmem[1]
extern "C"
__global__ void registerTestKernel_UINT_ARRAY(unsigned int *out)
{
int tid = blockDim.x*blockIdx.x + threadIdx.x;
int val[1];
val[0] = 0;
val[0] += tid;
val[0] += tid*2;
out[tid] = val[0];
}
ptxas output :
ptxas : info : Compiling entry function ‘registerTestKernel_UINT_ARRAY’ for ‘sm_13’
ptxas : info : Used 4 registers, 20 bytes smem, 8 bytes cmem[1], 4 bytes lmem
extern "C"
__global__ void registerTestKernel_UINT_1(uint1 *out)
{
int tid = blockDim.x*blockIdx.x + threadIdx.x;
uint1 val;
val.x = 0;
val.x += tid;
val.x += tid*2;
out[tid] = val;
}
ptaxs output :
ptxas : info : Compiling entry function ‘registerTestKernel_UINT_1’ for ‘sm_13’
ptxas : info : Used 4 registers, 20 bytes smem, 8 bytes cmem[1], 4 bytes lmem
As you can see I get lmem 4 bytes for ‘registerTestKernel_UINT_1’ and ‘registerTestKernel_UINT_ARRAY’.
In the array case : I know that registers can not be indexed and the compiler will put it in local memory space if it can not sort out the indexing but I thought that the indexing should be quite straight forward to figure out for the compiler in my case?
Using uint1 : Here I thought that the compiler should be able to treat this as an normal unsigned int but apperently not. Does anyone know why? All the build in types (int1, int2 etc ) seems to be put in local memory space.
Another question :
If I compile it for sm_3 I get the following ptxas
ptxas : info : Compiling entry function ‘registerTestKernel_UINT_ARRAY’ for ‘sm_30’
1> ptxas : info : Function properties for registerTestKernel_UINT_ARRAY
1> 8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 6 registers, 8 bytes cumulative stack size, 324 bytes cmem[0]
1> ptxas : info : Compiling entry function ‘registerTestKernel_UINT_1’ for ‘sm_30’
1> ptxas : info : Function properties for registerTestKernel_UINT_1
1> 8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 6 registers, 8 bytes cumulative stack size, 324 bytes cmem[0]
1> ptxas : info : Compiling entry function ‘registerTestKernel_UINT’ for ‘sm_30’
1> ptxas : info : Function properties for registerTestKernel_UINT
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 5 registers, 324 bytes cmem[0]
The difference here is that I use one more register and cumulative stack size gets 8 bytes. What is the cumulative stack size and why does it increase with 8 bytes. From the ptx files I can still see that the variable int val[1] in the array test kernel gets declared as a local variable and converting the local state pointer to generic addresses
.local .align 4 .b8 __local_depot1[4];
.reg .b32 %SP;
.reg .b32 %SPL;
.reg .s32 %r<16>;
.loc 1 127 1
func_begin1:
.loc 1 127 0
.loc 1 127 1
mov.u32 %SPL, __local_depot1;
cvta.local.u32 %SP, %SPL;
Hope someone can give me some insight =)