CUDA FORTRAN/OpenACC "Overflow" Register with maxr

Hi All,

Compute Capability 3.5 cards have a maximum of 65536 registers per block and 255 registers per thread, where (AFAIK) the 256th register is used to store the location in global memory to where registers are spilled (the “overflow” register). If I use 512 threads per block, I can use a maximum of 65536(registers/block)/512(threads/block) = 128 registers per thread, which means I need to use

maxregcount:n

when compiling. A value of 129 or more for n results in a launch error due to unavailable resources (as it should) and a value of 128 or less works, but I’m not sure why 128 is ok. Should the value of n be 128 or 127? If it should/can be 128, where is the “overflow” register?

Cheers,
Kyle

Hi Kyle,

The reason why this works is that the “overflow” register isn’t the 256th register rather it’s a special register instruction encoding (all 1’s) identified as “RZ”.

  • Mat

Hi Mat,

Thanks for the info. I’ve seen RZ in PTX before but couldn’t figure out what it was. I thought it might be something to do with round-to-zero, but it didn’t make sense for it to be that in the way it was used in PTX.

Knowing this, why is it possible to use 512(threads/block)*128(registers/thread)=65536(registers/block), but only 256(threads/block)*255(registers/thread)=65280?

Is there anywhere I can find the answers to questions like this?

Cheers,
Kyle

Hi Kyle,

You’re welcome to ask these questions here and I can then ask around if I don’t know. Though, stackoverflow.com is a good place to ask these types of low level CUDA questions.

While I don’t know for a fact, I would think the reasoning for the thread limit is that the register numbers are represented by a 2-byte value (00-FF). Given RZ is represented as FF (or all 1’s) only registers R0 through R254 can be identified. Hence, you’re hitting the limit of 255 registers per thread. In the other case, you’re hitting the limit of 65536 registers per block. Two separated but related limits.

If you want a more definitive answer, I can investigate.

  • Mat

Hi Mat,

If you want a more definitive answer, I can investigate.

No need. Couldn’t have asked for a better answer.

Cheers,
Kyle

Hi Mat/All,

I didn’t check my answer properly when I used 512(threads/block)*128(registers/thread)=65536(registers/block). This gives an incorrect answer. I had to use 512(threads/block)*127(registers/thread)=65024(registers/block) to get the correct answer. Sorry for the error.

Regards,
Kyle