Why can I run sm_10 binaries with >64 registers/thread on Fermi/Kepler just fine?


I would like to understand why it works to use 64-124 registers per thread targeting SM_10 through SM_13 and the resulting binary will run just fine on Fermi, Kepler devices that don’t support a native instruction set with more than 64 registers.

My binaries are usually generated with embedded PTX code, but my understanding is that this PTX code has to be translated into the card’s native instruction set before execution. Wouldn’t it result in register spilling at that stage?

In contrast, when I target SM_20 through SM_35 I get register spilling or high stack utilization already at the compilation stage that slows the kernel down - often to the point of it being unusable

So, concluding. When developing kernels with high register pressure, I usually compile for sm_1x and it’s working well for me - I’d just like to understand how that is possible.

It is possible that the JIT compilation isn’t exactly the same as a command line compile. For example allow-expensive-optimizations (a ptxas option) may be false for speed / resource reasons. You could try toggling that switch with the command line compile to see if it has any effect.

Also you should be able to compile the binaries as you like directly by specificying the gpu-architecture as compute_13 and the gpu-code as sm_30. Then you can examine the ISA for both versions to see what is going on.

sm_35 provides for 255 programmer-usable registers, so you should in general not run into the kind of register pressure issues that may be encountered on sm_2x and sm_30 platforms that only provide 63 programmer-usable registers (one register encoding is used for a dedicated zero register in all these architecture). sm_1x provides for 124 programmer-usable registersm, R0 through R123.

If you JIT sm_1x code on sm_2x platforms you can certainly get register spilling in the resulting machine code, you may just not be aware of it. Offhand I don’t know whether the JIT compiler offers the kind of per-kernel resource usage reporting that is activated by -Xptxas -v in offline builds.

Please note that there are some differences between sm_1x compilation and sm_2x / sm_3x compilation that can lead to higher register use with the latter:

(1) Single-precision arithmetic by default is IEEE-754 compliant on the sm_2x and sm_3x, meaning denormals are supported and square root, reciprocal, and division are correctly rounded. On sm_1x, FTZ (flush to zero) is used, and square root, reciprocals, and division are approximate. To approximate the sm_!x behavior, compile with -ftz=true -prec-sqrt=false -prec-div=false

(2) CUDA keeps data types consistent between host and device. In particular, the size of pointers matches the host platform, so on a 64-bit platform devic pointers are also 64 bit. However, for a sm_1x target, the compiler knows that no devices with > 4 GB memory ever shipped, so it optimizes aggressively knowing that only the lower 32 bits of a pointer are ever during memory access. Most of these optimizations are not possible on later architectures for which GPUs with > 4GB memory are available.

You should also be aware that different compiler front-ends are used: For sm_1x PTX is generated by the Open64 compiler, while for sm_2x and sm_3x targets an LLVM-derived compiler is used.