It is entirely possible for the same HLL source code to result in machine code with very different register usage, dependent on machine architecture. CUDA source code is initially translated into an intermediate, platform-independent, code called PTX. This code is then further compiled to machine code with a tool chain component called, somewhat misleadingly, ptxas.
The compilation of PTX to machine code uses machine (architecture) specific code transformations, in particular for register allocation, instruction selection, and instruction scheduling. For GPU architectures with increased register file size, the relevant heuristics will enable code transformations that generally will improve performance at the cost of additional registers used.
Overall, that is a good thing, as available hardware resource are utilized fully to maximize performance. Occasionally, though, this can backfire, leading to a register use explosion. This could constitute a bug, or could simply be a limitation of the chosen heuristic. It is impossible to tell which it is here without knowledge of the code. All heuristics generally have the property that they can deliver good or acceptable results for the large majority of use cases, but will deliver sub-optimal results for a small number of use cases.