Just a tip:
I have been cleaning up a set kernels to work on pre-Fermi devices. The kernels compile clean and run very fast on Fermi so I was disappointed when I saw 200+ bytes of spills when targeting sm_1x devices.
The workaround was to use the unsupported switch [font=“Courier New”]–nvvm[/font] to force use of the newer LLVM compiler path.
The LLVM compiled kernels fit into the available registers and pass all my tests. With the spills gone I got a very solid performance improvement.
I have some more detail here.