registers number

i wrote code on ptx
69 64-bits registers
2 32-bits registers
6 predicates
i think this code uses 140 32-bits registers + 6 predicates on cuda 3.5
i checked with nvprof…
code uses all 255 registers + local memory

Your question as to “why” is impossible to answer without seeing the entire code and analyzing the generated machine code. You can look at the machine code by disassembling with cuobjdump --dump-sass

Keep in mind that PTXAS (the part of the compiler that translates PTX to machine language) is a compiler, not an assembler. So your code may be transformed more extensively than you expect. For example, PTXAS can unroll loops or extract common subexpression that are then assigned to temporary variables. Loads may be scheduled early to increase memory latency tolerance, but also increasing register live ranges and thus overall register usage.

Your PTX code may contain operations for which there is no hardware support and that need to be emulated, requiring both additional instructions and additional registers. Examples would be double-precision division and square root.