sm_35 provides for 255 programmer-usable registers, so you should in general not run into the kind of register pressure issues that may be encountered on sm_2x and sm_30 platforms that only provide 63 programmer-usable registers (one register encoding is used for a dedicated zero register in all these architecture). sm_1x provides for 124 programmer-usable registersm, R0 through R123.
If you JIT sm_1x code on sm_2x platforms you can certainly get register spilling in the resulting machine code, you may just not be aware of it. Offhand I don’t know whether the JIT compiler offers the kind of per-kernel resource usage reporting that is activated by -Xptxas -v in offline builds.
Please note that there are some differences between sm_1x compilation and sm_2x / sm_3x compilation that can lead to higher register use with the latter:
(1) Single-precision arithmetic by default is IEEE-754 compliant on the sm_2x and sm_3x, meaning denormals are supported and square root, reciprocal, and division are correctly rounded. On sm_1x, FTZ (flush to zero) is used, and square root, reciprocals, and division are approximate. To approximate the sm_!x behavior, compile with -ftz=true -prec-sqrt=false -prec-div=false
(2) CUDA keeps data types consistent between host and device. In particular, the size of pointers matches the host platform, so on a 64-bit platform devic pointers are also 64 bit. However, for a sm_1x target, the compiler knows that no devices with > 4 GB memory ever shipped, so it optimizes aggressively knowing that only the lower 32 bits of a pointer are ever during memory access. Most of these optimizations are not possible on later architectures for which GPUs with > 4GB memory are available.
You should also be aware that different compiler front-ends are used: For sm_1x PTX is generated by the Open64 compiler, while for sm_2x and sm_3x targets an LLVM-derived compiler is used.