How many registers could one thread use at most? When would the data be spit out to local memory?

Hardy616 · July 1, 2010, 1:22am

Hi, I know that if I use too many registers, some data would be spit to local memory which would harm the performance. But how many is too many? Recently I’m doing a test. There are only 64 threads in a block. I use Tesla and compute capacity 1.3 which should have 16k registers. That means one thread could use 256 registers. But the fact is that I can never use more than 128 registers per thread. When the number of registers is beyond something like 124, rest of the data was put into local memory. I searched but I didn’t find the threshold after which the data would be put into local memory. Anybody has some clues? Thanks!

tera · July 1, 2010, 1:40am

I thought it’s documented somewhere, but couldn’t immediately find it either. If I remember correctly, the maximum number of registers is 127 for compute capability 1.x and 64 (or 63?) for 2.0.
The reason probably is that there is only a fixed number of bits available in the binary instruction format, although it’s not documented. You might dig into the decuda sources to find out whether that is true.

tera · July 1, 2010, 1:40am

I thought it’s documented somewhere, but couldn’t immediately find it either. If I remember correctly, the maximum number of registers is 127 for compute capability 1.x and 64 (or 63?) for 2.0.
The reason probably is that there is only a fixed number of bits available in the binary instruction format, although it’s not documented. You might dig into the decuda sources to find out whether that is true.

Hardy616 · July 1, 2010, 1:58am

So this means if I use 64 threads in one block, then I can use at most 127*64=8128 registers which means the other 8k registers are wasted?

Hardy616 · July 1, 2010, 1:58am

So this means if I use 64 threads in one block, then I can use at most 127*64=8128 registers which means the other 8k registers are wasted?

tera · July 1, 2010, 2:09am

Yes.

tera · July 1, 2010, 2:09am

Yes.

Hardy616 · July 1, 2010, 2:17am

OK. Thanks…

Hardy616 · July 1, 2010, 2:17am

OK. Thanks…

laughingrice · July 1, 2010, 12:18pm

You have 8k registers (32bits each) with compute 1.0 and 1.1 and 16k with compute 1.2 and 1.3 (don’t recall the number on fermi). Theoretically you are limited only by that number divided by the number of active threads.

Practically there is a compiler switch (-maxrregcnt if I’m not mistaken) that limits that maximum to play between register usage and occupancy. I believe that the default maximum is 32 registers, but you can change that. I’ve had kernels with 128 registers, so it may not be efficient, but it works.

laughingrice · July 1, 2010, 12:18pm

You have 8k registers (32bits each) with compute 1.0 and 1.1 and 16k with compute 1.2 and 1.3 (don’t recall the number on fermi). Theoretically you are limited only by that number divided by the number of active threads.

Practically there is a compiler switch (-maxrregcnt if I’m not mistaken) that limits that maximum to play between register usage and occupancy. I believe that the default maximum is 32 registers, but you can change that. I’ve had kernels with 128 registers, so it may not be efficient, but it works.

avidday · July 1, 2010, 12:55pm

That isn’t correct. There is a hard limit on registers per thread defined in the PTX specification - 127 on pre 2.0 devices and 63 on Fermi IIRC.

avidday · July 1, 2010, 12:55pm

That isn’t correct. There is a hard limit on registers per thread defined in the PTX specification - 127 on pre 2.0 devices and 63 on Fermi IIRC.

Hardy616 · July 1, 2010, 1:32pm

Yes, I know I can use --maxrregcount but that is to limit the registers that you could use and the maximum is 128. What I’m trying to do is to assign each thread more than 128 registers to use. I think --maxrregcount is for occupancy purpose which could reduce the usage of registers and active more warps regardless of shared memory usage.

Hardy616 · July 1, 2010, 1:32pm

Yes, I know I can use --maxrregcount but that is to limit the registers that you could use and the maximum is 128. What I’m trying to do is to assign each thread more than 128 registers to use. I think --maxrregcount is for occupancy purpose which could reduce the usage of registers and active more warps regardless of shared memory usage.

Hardy616 · July 1, 2010, 1:35pm

Can you specify which document? I have tried but been unable to find a document which talks about this, the limit on the number of registers per thread. But according to what I have tested, this seems the truth.

Hardy616 · July 1, 2010, 1:35pm

Can you specify which document? I have tried but been unable to find a document which talks about this, the limit on the number of registers per thread. But according to what I have tested, this seems the truth.

avidday · July 1, 2010, 3:04pm

The nvcc 3.0 documentation says 128 registers per thread (pp 16-17). I have in in my head that the limit is lower for Fermi, but I can’t remember where it is documented.

avidday · July 1, 2010, 3:04pm

The nvcc 3.0 documentation says 128 registers per thread (pp 16-17). I have in in my head that the limit is lower for Fermi, but I can’t remember where it is documented.

Hardy616 · July 1, 2010, 3:11pm

Thanks so much. I thought this 128 was the maximum only for this --maxrregcount instruction. So 128 registers per thread is also for hardware.

Topic		Replies	Views
CUDA FORTRAN/OpenACC "Overflow" Register with maxr Legacy PGI Compilers	5	6238	February 5, 2014
number of registers CUDA Programming and Performance	2	1920	November 7, 2007
CUDA Fortran- threads Legacy PGI Compilers	5	4133	April 14, 2011
Is it possible to use more than 124 registers in kernel? CUDA Programming and Performance	15	4268	October 16, 2009
Registers per thread limit and occupancy CUDA Programming and Performance	3	10177	March 30, 2007
regsPerBlock CUDA Programming and Performance	4	2556	September 28, 2008
max number of block CUDA Programming and Performance	21	18130	April 20, 2010
Hard limit of 64 regs/thread on Fermi? fermi registers limit CUDA Programming and Performance	3	2286	June 16, 2010
Threads per block equation CUDA Programming and Performance	5	3114	April 3, 2008
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	5096	June 30, 2008

How many registers could one thread use at most? When would the data be spit out to local memory?

Related topics