I read the CUDA_Occupancy_calculator.xls, and it does not say “124 registers per thread” is a “Physical Limits for GPU”.
In fact, I am really unwilling to put the registers in local memory, as it will greatly slow down the kernel. But the compiler only put 124 of my variables in registers and all others in local memory.
Thanks for clarification Eyal…
Another simple and efficient way to decrease register usage, apart for smem usage, is to declare volatile variables and float values instead of double ones (that is: don’t forget to put “f” after numbers). Use math function like __sinf() instead of sin() if you can accept a little decrease in precision.
Cheers,
Thanks for the explanation. But my situation is really a extreme case because this is a BIG algorithm that need lots of frequently read & updated states.
Everything is “unsigned int” which is 32-bit.
All variables are defined with “unsigned int v00, v01, v02, …” style, and no “unsigned int v[16]” style is used. So no local memory is explicitly declared.
I understand that “less occupancy” can slow down the kernel and local memory is slow. But my algorithm naturally need lots of variables. It turned out that when I use “too many” variables, nvcc put 124 of them in registers and all others in local memory.
Initially I want to use 256 registers in a thread. With “Compute Capability 1.3”, there is 16384 registers per multiprocessor. If I use 256 registers, there can still be 16384/256 = 64 threads concorrently running in each multiprocessor.
With “Compute Capability 1.3”, 16384 registers and 16KB of shared memory for each multiprocessor. If I use 128 registers per thread, there is only 32 32-bit shared memory for each thread to use. That is still not enough for my algorithm.
Our code is also using a complex algorithm, I’ve used --maxregcount to limit the reg usage and it still runs fairly ok.
smem might indeed save you only a few registers based on what you describe.
reducing the thread count per block to 64 might assist a bit - you can even experiment with the number of threads and see which gives you the best performance.
obviously if you could post the code we might be able to suggest more stuff…
volatile is good for any intermediate result that you want assigned to a register immediately. Also for constant values that appear several times in following computations.
CUDA often wastes registers by computing the same stuff multiple times, let’s say you use the following array index several times in some code (for example several times inside a tight loop),
[i*5+y]
CUDA often inlines this computations into the PTX assembly and computes i*5+y multiple times into different target registers. It can be a waste.
volatile int index = i*5+y;
With the above code you would force CUDA to compute it and store it in a register before you enter your computation loop. Then you will use [index] inside the loop. That of course implies that i and y have to be constant within the loop ;)
The following is a good one also. Constants can also be put into a volatile variable, because otherwise CUDA likes to load the same constant over and over into new registers, even if is the very same constant.
Say you have some code like
foo = 1.0f + sin(x); bar = 1.0f - cos(x)
Instead use this.
volatile float one = 1.0;
foo = one + sin(x); bar = one - cos(x)
The above saves you one register inside the PTX, which often translates to one saved register in the .cubin as well.
In some cases the tricks I outlined above will cross the threshold to getting a better occupancy on the GPU, especially if it is just a few registers you are short.
Well, use shared memory. Bank conflicts are not necessarily a severe problem. Or try to re-think your algorithm. Maybe it is possible to split up the computation across several kernel calls, for example. 500 registers/thread is ridiculous - take into account that a SM needs around 200 threads to hide latencies and then imagine the huge amount of register memory the GPU would need to have :)
Fermi will have a ratio of registers/processor like G80.
In the Tesla instruction set, register IDs are stored using 7 bits each in the instruction word. Register 124 always contains 0, and registers 125 to 127 probably contain 0 as well or some other constants.
So you can use only the architectural registers R0 to R123.
Even if you could run threads with more registers than that, there wouldn’t be much point in doing so. 64 threads per SM is not enough to saturate the arithmetic units, let alone the memory subsystem…
less registers per core, but more per thread - you have 32k regs for 1.5k threads ~ 21 per thread for 100% occupancy, on G80 it was ~10 per thread - 8k regs for 768 threads and on gt200 it was 16 per thread - 16k regs for 1024 threads.
and for 124 regs per thread, if you use so much local memory, maybe using a little more will not slow you if you incrase occupancy? you could use more bandwidth, so it actualy could be a speed up? with 64 regs per thread you could have 8 warps per mp, it is still very low for hiding memory latency, but it will be far better from 4 warps in your case, which can’t even hide instruction latency.