shared memory & register usage

Dear experts

I compiled the following code using CUDA 1.0.
I look like this code without using shared memory.
And, I look like this code with 20 byte of registers, because the code used variable x,y,tmp, row, col.
but, in the cubin file, smem was 28 and reg was 8.
I couldn’t understand this reason.
Please teach me this reason.

Thank you.


global void mul(float *a, float b, float c)
{
int x=threadIdx.x, y=threadIdx.y;
float t=0.0;
for(int i=0; i<N; i++)
{
int m=i+y
N;
int n=x+i
N;
t=a[m]*b[n];
}

c[x+y*N]=t;

}

code {
name = mul;
lmem = 0
smem = 28
reg = 8
bar = 0
bincode {

}
const {
segname = const
segnum = 1
offset = 0
bytes = 4
mem {
0x000003ff
}
}
}

The kernel launch internal parameters need 16 byte shared mem. The arguments to your kernel are also passed via shared mem. Thus 16 + (3*4) = 28

The number of registers is due to how the ptx assembler compiles and optimizes the code. This is not straight forward to understand. See various other posts on this forum if you want to learn more. For first experiments however, don’t bother about it.

Peter

Dear Peater

Thank you very much for your good answer.
I can understand the reason of smem usage.
And I understand the reason of reg usage, but , I feel that you say this is not straight forward to understand :-).

Thank you very much.