How is the number of required registers per thread counded?

kwang · November 20, 2009, 11:23pm

Hi, there,

In my kernel, I only claimed 11 variable(int/float), so I assume each thread may use only 11 registers. But I did a lot of define. I compile it by nvcc with flag --ptxas-options=-v. It said ‘Used 31’ registers. It limits the number of threads per block I can use. Could anybody explaim to me a little bit about how CUDA counts the number of registers I used?

Thanks very much.

seibert · November 20, 2009, 11:33pm

A variable at the C level does not correspond to a register because frequently the compiler needs to store intermediate values in a calculation somewhere. (A statement in C can compile down to many instructions in PTX.) This can push the number of registers up, especially if you have complex expressions in your code. On the other hand, the assembler is also free to reuse a register for multiple variables when possible, so that can bring the register usage back down. In general, there is only a weak correlation between the number of variables at the C level and the number registers required on the device.

You can force the compiler to use fewer registers with the --maxrregcount option to nvcc. This can cause the compiler to put intermediate results into local memory (which is stored confusingly in the off-chip global memory area), which can slow things down. You can experiment and see if it helps in your case.

kwang · November 20, 2009, 11:46pm

A variable at the C level does not correspond to a register because frequently the compiler needs to store intermediate values in a calculation somewhere. (A statement in C can compile down to many instructions in PTX.) This can push the number of registers up, especially if you have complex expressions in your code. On the other hand, the assembler is also free to reuse a register for multiple variables when possible, so that can bring the register usage back down. In general, there is only a weak correlation between the number of variables at the C level and the number registers required on the device.

You can force the compiler to use fewer registers with the --maxrregcount option to nvcc. This can cause the compiler to put intermediate results into local memory (which is stored confusingly in the off-chip global memory area), which can slow things down. You can experiment and see if it helps in your case.

Thank you so much. You explained a lot to me. But it makes me feel more desperate about cuda coding since I can not control the usage of register at all…