Kernel arguments are stored in smem for compute capability 1.x and in cmem for 2.x.
As your kernel does not do anything, it is also easily optimized to use zero registers. Why the compiler is not able to fully optimize the kernel away on sm_20, I don’t know. But it’s probably more interesting to investigate a kernel that actually does something.
One con of parameters in smem is that sometimes an algorithm requires a power of two bytes in smem. Since 1.x devices store arguments there and use another 16 bytes per block that you can’t (legally) get rid of, you’ll then have to waste half of the shared memory.
Parameters in cmem also allow a longer parameter list, since cmem is not as scarce as smem.
Current CUDA implementations follow a load+store architecture, which means that any operations can only be done between registers. Values from global memory must be loaded into registers first, and results need to be stored to global memory afterwards. So the minimum number of registers for this kernel is 4, to hold i, A[i], B[i], and result[i], which are all needed at the same time. This already assumes that temporary registers for the address calculations are reused.
I assume the kernel compiled for sm_20 uses more registers due to some optimization. However I am unable to check that as the CUDA 3.1 installation on the computer I am currently on actually uses only 4 registers even for sm_20.