Hi, I know that if I use too many registers, some data would be spit to local memory which would harm the performance. But how many is too many? Recently I’m doing a test. There are only 64 threads in a block. I use Tesla and compute capacity 1.3 which should have 16k registers. That means one thread could use 256 registers. But the fact is that I can never use more than 128 registers per thread. When the number of registers is beyond something like 124, rest of the data was put into local memory. I searched but I didn’t find the threshold after which the data would be put into local memory. Anybody has some clues? Thanks!
I thought it’s documented somewhere, but couldn’t immediately find it either. If I remember correctly, the maximum number of registers is 127 for compute capability 1.x and 64 (or 63?) for 2.0.
The reason probably is that there is only a fixed number of bits available in the binary instruction format, although it’s not documented. You might dig into the decuda sources to find out whether that is true.
I thought it’s documented somewhere, but couldn’t immediately find it either. If I remember correctly, the maximum number of registers is 127 for compute capability 1.x and 64 (or 63?) for 2.0.
The reason probably is that there is only a fixed number of bits available in the binary instruction format, although it’s not documented. You might dig into the decuda sources to find out whether that is true.
So this means if I use 64 threads in one block, then I can use at most 127*64=8128 registers which means the other 8k registers are wasted?
So this means if I use 64 threads in one block, then I can use at most 127*64=8128 registers which means the other 8k registers are wasted?
Yes.
Yes.
OK. Thanks…
OK. Thanks…
You have 8k registers (32bits each) with compute 1.0 and 1.1 and 16k with compute 1.2 and 1.3 (don’t recall the number on fermi). Theoretically you are limited only by that number divided by the number of active threads.
Practically there is a compiler switch (-maxrregcnt if I’m not mistaken) that limits that maximum to play between register usage and occupancy. I believe that the default maximum is 32 registers, but you can change that. I’ve had kernels with 128 registers, so it may not be efficient, but it works.
You have 8k registers (32bits each) with compute 1.0 and 1.1 and 16k with compute 1.2 and 1.3 (don’t recall the number on fermi). Theoretically you are limited only by that number divided by the number of active threads.
Practically there is a compiler switch (-maxrregcnt if I’m not mistaken) that limits that maximum to play between register usage and occupancy. I believe that the default maximum is 32 registers, but you can change that. I’ve had kernels with 128 registers, so it may not be efficient, but it works.
That isn’t correct. There is a hard limit on registers per thread defined in the PTX specification - 127 on pre 2.0 devices and 63 on Fermi IIRC.
That isn’t correct. There is a hard limit on registers per thread defined in the PTX specification - 127 on pre 2.0 devices and 63 on Fermi IIRC.
Yes, I know I can use --maxrregcount but that is to limit the registers that you could use and the maximum is 128. What I’m trying to do is to assign each thread more than 128 registers to use. I think --maxrregcount is for occupancy purpose which could reduce the usage of registers and active more warps regardless of shared memory usage.
Yes, I know I can use --maxrregcount but that is to limit the registers that you could use and the maximum is 128. What I’m trying to do is to assign each thread more than 128 registers to use. I think --maxrregcount is for occupancy purpose which could reduce the usage of registers and active more warps regardless of shared memory usage.
Can you specify which document? I have tried but been unable to find a document which talks about this, the limit on the number of registers per thread. But according to what I have tested, this seems the truth.
Can you specify which document? I have tried but been unable to find a document which talks about this, the limit on the number of registers per thread. But according to what I have tested, this seems the truth.
The nvcc 3.0 documentation says 128 registers per thread (pp 16-17). I have in in my head that the limit is lower for Fermi, but I can’t remember where it is documented.
The nvcc 3.0 documentation says 128 registers per thread (pp 16-17). I have in in my head that the limit is lower for Fermi, but I can’t remember where it is documented.
Thanks so much. I thought this 128 was the maximum only for this --maxrregcount instruction. So 128 registers per thread is also for hardware.