How to understand lmem, smem, reg?

Hi, all,

I write a small kernel to count the number of register, local memory, smem.

As you can see:
global void vectorAddition(float* A,float* B,float* result)
{
int i=threadIdx.x;
}

The -ptxas info return:

1>ptxas info : Compiling entry function ‘Z14vectorAdditionPfS_S’ for ‘sm_20’
1>ptxas info : Used 2 registers, 56 bytes cmem[0]

1>ptxas info : Compiling entry function ‘Z14vectorAdditionPfS_S’ for ‘sm_10’
1>ptxas info : Used 0 registers, 24+16 bytes smem

Why do the number of resource usages are different under 2 GPU architecture?

Moreover, are there any files or links explaining the principle that how to count the number/size of register, lmem, cmem and smem?

Thx~

Kernel arguments are stored in smem for compute capability 1.x and in cmem for 2.x.

As your kernel does not do anything, it is also easily optimized to use zero registers. Why the compiler is not able to fully optimize the kernel away on sm_20, I don’t know. But it’s probably more interesting to investigate a kernel that actually does something.

Hi, tera. So, what are the pros and cons that store the kernel parameters in smem and cmem, respectively?

Hi, I change the kernel so that it looks more “meaningful”:

global void vectorAddition(float* A,float* B,float* result)
{
int i=threadIdx.x;

result[i]=A[i]+B[i];
}

Now the result is:

1>ptxas info : Compiling entry function ‘Z14vectorAdditionPfS_S’ for ‘sm_20’
1>ptxas info : Used 12 registers, 56 bytes cmem[0]

1>ptxas info : Compiling entry function ‘Z14vectorAdditionPfS_S’ for ‘sm_10’
1>ptxas info : Used 4 registers, 24+16 bytes smem

Could someone tell me why does the system return those numbers? For example, why the kernel needs 4 reg under sm_10?

One con of parameters in smem is that sometimes an algorithm requires a power of two bytes in smem. Since 1.x devices store arguments there and use another 16 bytes per block that you can’t (legally) get rid of, you’ll then have to waste half of the shared memory.

Parameters in cmem also allow a longer parameter list, since cmem is not as scarce as smem.

Other than that, I don’t see a lot of difference.

Current CUDA implementations follow a load+store architecture, which means that any operations can only be done between registers. Values from global memory must be loaded into registers first, and results need to be stored to global memory afterwards. So the minimum number of registers for this kernel is 4, to hold i, A[i], B[i], and result[i], which are all needed at the same time. This already assumes that temporary registers for the address calculations are reused.

I assume the kernel compiled for sm_20 uses more registers due to some optimization. However I am unable to check that as the CUDA 3.1 installation on the computer I am currently on actually uses only 4 registers even for sm_20.