I am guessing you are on a 64-bit host platform, because that is typically where the biggest changes in register use are seen when transitioning from sm_1x to sm_2x.
CUDA makes all device-side data types the same size as the corresponding host-side data type. This makes compound data types portable across the host-device boundry, for example. In particular, on a 64-bit host platform pointers and size_t are 64-bit types on both host and device, and for those 64-bit platforms where a long occupies 64 bits the same applies to code on the device.
Note that there is no hardware support for most 64-bit integer operations on current GPUs, meaning a 64-bit operand takes up two 32-bit registers and most 64-bit operations require two or more 32-bit operations.
No sm_1x platform accomodated more than 4GB of memory, meaning for pointers on the device only the lower 32 bits were meaningful. This allowed the compiler to optimize out a lot of the operations that dealt with the most significant 32 bits of device pointers, which in turn freed up the registers holding the upper 32 bits.
There are however sm_2x platforms that provide more than 4 GB of memory, so the majority of these optimizations no longer apply. This means that for builds on a 64-bit host, code compiled for sm_2x carries around the full 64 bits of a pointer at all times, which increases the register use compared to an sm_1x target. Note that these pointers are not only pointers explicitly occuring in the source code, but could also be pointers created by the compiler as part of common optimizations, for example by strength reduction / induction variable creation during array traversal in a loop.
For code that works a lot with shared memory, there is an additional source of increased register usage. In sm_1x devices, shared memory could be accessed via special address registers that existed in addition to the general purpiose register, but on sm_2x device general purpose registers are used for this purpose as separate address registers no longer exist.
There are additional second order effects that can increase register pressure on sm_2x devices. The sm_2x instruction set architecture is more of a strict load/store architecture, and therefore frequently requires a couple of additional registers for temporary storage. The introduction of an ABI (needed to support many C++ features, device-side printf and malloc, etc) means that one extra register is required for a stack pointer.
The compiler has been improved consistently since sm_2x support was first added in CUDA 3.0. At this point you would definitely want to use the CUDA 4.0 toolchain. Please keep in mind that based on the changes described above, some increase in register usage is unavoidable for most code when transitioning from sm_1x to sm_2x even as the compiler works hard to keep register usage low. While this is not typically necessary, you may want to look into using either the launch_bounds attribute or the -maxrregcount compiler switch to limit register use per thread below the bounds picked by the compiler. Note that this may cause dynamic instruction count to increase and/or register spilling to occur, either of which may in turn decrease performance. The caches on sm_2x devices can absorb minor spilling, but significant spilling will overwhelm the caches. The compiler flag -Xptxas -v causes the compiler to emit some relevant statistics for each kernel.
If your code uses single-precision reciprocal, divisions, and square root, there is an additional source of register use increase when transitioning from sm_1x to sm_2x. For sm_1x platforms the division operator (with reciprocal as a special sub-case) and the sqrtf() function map to approximate versions, while on sm_2x platforms they map to IEEE-rounded versions by default. The increase in accuracy requires more elaborate implementations that use more instruction and some additional registers. You can approximate the sm_1x behavior by passing the following compiler flags to nvcc: -ftz=true -prec-sqrt=false -prec-div=false