Compute Capability 3.5 cards have a maximum of 65536 registers per block and 255 registers per thread, where (AFAIK) the 256th register is used to store the location in global memory to where registers are spilled (the “overflow” register). If I use 512 threads per block, I can use a maximum of 65536(registers/block)/512(threads/block) = 128 registers per thread, which means I need to use
when compiling. A value of 129 or more for n results in a launch error due to unavailable resources (as it should) and a value of 128 or less works, but I’m not sure why 128 is ok. Should the value of n be 128 or 127? If it should/can be 128, where is the “overflow” register?