In general, the increase in register use from the use of the ABI should be minor (often just one register). Prior to CUDA 4.0 there were still a few rough edges regarding the use of the ABI which in some cases drove up the register usage significantly. Best I know, all known issues of this nature were fixed for CUDA 4.0 at the latest.
As has been pointed out above, one should expect to see differences in register usage based on host platform and GPU architecture.
The sm_2x architecture made various changes compared to the sm_1x architecture that tend to increase the use of general purpose registers. In particular it is more of a strict load-store architecture, and it removed separate address registers. Also, with sm_2x IEEE-compliant single-precision division and square root became the default implementations. The subroutines that implement this functionality require additional registers over the approximate implementations used on sm_1x architecture GPUs. Note that programmers can opt out of the IEEE-compliant single-precision division and square root by use of the compiler switches -prec-div=false, -prec_sqrt=false.
In terms of host platform, it has always been the case that on 64-bit host platforms both the host and the device side code use 64-bit pointers, and both use 64-bit “long” on host platforms where “long” is a 64-bit type. This provides storage consisteny across the host-device boundary, which is useful for passing structs, for example. Consequently, more registers may be required for device code when the build target is a 64-bit platform, compared to the equivalent 32-bit platform. For sm_1x devices, the compiler applies various optimizations based on the fact that no sm_1x GPU provides more than 4GB of storage, keeping the increase in register use between 32-bit and 64-bit platforms to a minimum. Since there are sm_2x devices with more than 4GB of memory, most of these optimizations are no longer applicable on sm_2x device. On a 64-bit platform code compiled for sm_2x therefore tends to use more registers then the same code built for sm_1x. The increase in register usage can be significant if many pointers are used in the code.