`maxrregcount` silently ignored by `nvcc` and `ptxas`

If you look at the occupancy calculator, you will see that the number of registers per thread can only be changed to discrete numbers. Or at least the occupancy only makes a difference at discrete numbers.

Ampere has (like most architectures) 65536 registers per SM and specifically a maximum of 1536 threads/SM. That is 42,… registers per thread. It makes no sense to demand less. Or for the GPU to be able to assign less. The next practical limit (divisible by 8? divisible by 4?) is 40.

So I would recommend to find a way to create an example for the 44 registers case.

You can also try with __launch_bounds__ and hint to your add_kernel_10 that you want to start 1536 threads of this kernel. With 44 registers/thread it would be over the limit then.

@njuffa: 65536 / 1024 = 64; so Turing could not use a register allocation less than 64 for anything useful, as the maximum number of threads per SM is 1024. See also this post of yours I just found ;-): questions about maxrregcount and Xptxas

For your tests, please try turning off the abi with -abi=no