Is it possible to use more than 124 registers in kernel?

My kernel need lots of registers. When I compile with “–maxrregcount 256”, the output of nvcc is:

ptxas info    : Used 124 registers, 440+0 bytes lmem, 28+24 bytes smem, 756 bytes cmem[1]

So is 124 the hard limit of registers that can be used in single kernel? Is it possible to use more?

Num registers is a function of your kernel.

Have a look at CUDA_Occupancy_calculator for more info.
Cheers,

luca

I read the CUDA_Occupancy_calculator.xls, and it does not say “124 registers per thread” is a “Physical Limits for GPU”.

In fact, I am really unwilling to put the registers in local memory, as it will greatly slow down the kernel. But the compiler only put 124 of my variables in registers and all others in local memory.

Any other comments?

To clarify the previous posters:

The ptxas line means that your kernel actually uses 124 registers which is indeed a lot. You should

also use the Occupancy calculater (not read but actually put in the ptxas output and see what your occupancy is).

In anycase as Sarnath said, the number of registers, lmem and smem is defined by your code. so basically you have

a kernel with a lot of registers (less occupancy) and a lot of lmem (slow). What you should do is try to figure out

why you’re using all that registers and lmem - best/fastest way is to comment code in your kernel and see which lines

takes registers/lmem (maybe math things like sin,cos,atan…) and try to avoid them or do things in a different way.

Your smem usage is very low - maybe you can use it instead of registers…

Also you might be able to break your kernel into multiple kernels to reduce the register pressure.

eyal

Thanks for clarification Eyal…
Another simple and efficient way to decrease register usage, apart for smem usage, is to declare volatile variables and float values instead of double ones (that is: don’t forget to put “f” after numbers). Use math function like __sinf() instead of sin() if you can accept a little decrease in precision.
Cheers,

luca

Thanks for the explanation. But my situation is really a extreme case because this is a BIG algorithm that need lots of frequently read & updated states.

Everything is “unsigned int” which is 32-bit.

All variables are defined with “unsigned int v00, v01, v02, …” style, and no “unsigned int v[16]” style is used. So no local memory is explicitly declared.

I understand that “less occupancy” can slow down the kernel and local memory is slow. But my algorithm naturally need lots of variables. It turned out that when I use “too many” variables, nvcc put 124 of them in registers and all others in local memory.

Initially I want to use 256 registers in a thread. With “Compute Capability 1.3”, there is 16384 registers per multiprocessor. If I use 256 registers, there can still be 16384/256 = 64 threads concorrently running in each multiprocessor.

With “Compute Capability 1.3”, 16384 registers and 16KB of shared memory for each multiprocessor. If I use 128 registers per thread, there is only 32 32-bit shared memory for each thread to use. That is still not enough for my algorithm.

Any more comments?

It’s really difficult to say what exactly is going on in your kernel without seeing the code.

Our code is also using a complex algorithm, I’ve used --maxregcount to limit the reg usage and it still runs fairly ok.

smem might indeed save you only a few registers based on what you describe.

reducing the thread count per block to 64 might assist a bit - you can even experiment with the number of threads and see which gives you the best performance.

obviously if you could post the code we might be able to suggest more stuff…

eyal

Can you explain what correspond to “volatile variables”, and why it is better: I thought the compilor re-uses “dead” variables.

I have the same problem with the number of registers: 124 is not so enormous as some of you consider External Media …Ideally 500 would

be necessary fo my case: I have a lot of points to describe and to keep “alive”. :">

One solution can consist to use shared memory, but with the bank conflicts it will not be so fast. (But surely a lot better than lmem). :rolleyes:

Does someone know if with Fermi, will we have more registers ? External Media

volatile is good for any intermediate result that you want assigned to a register immediately. Also for constant values that appear several times in following computations.

CUDA often wastes registers by computing the same stuff multiple times, let’s say you use the following array index several times in some code (for example several times inside a tight loop),

[i*5+y]

CUDA often inlines this computations into the PTX assembly and computes i*5+y multiple times into different target registers. It can be a waste.

volatile int index = i*5+y;

With the above code you would force CUDA to compute it and store it in a register before you enter your computation loop. Then you will use [index] inside the loop. That of course implies that i and y have to be constant within the loop ;)

The following is a good one also. Constants can also be put into a volatile variable, because otherwise CUDA likes to load the same constant over and over into new registers, even if is the very same constant.

Say you have some code like

foo = 1.0f + sin(x); bar = 1.0f - cos(x)

Instead use this.

volatile float one = 1.0;

foo = one + sin(x); bar = one - cos(x)

The above saves you one register inside the PTX, which often translates to one saved register in the .cubin as well.

In some cases the tricks I outlined above will cross the threshold to getting a better occupancy on the GPU, especially if it is just a few registers you are short.

Christian

Well, use shared memory. Bank conflicts are not necessarily a severe problem. Or try to re-think your algorithm. Maybe it is possible to split up the computation across several kernel calls, for example. 500 registers/thread is ridiculous - take into account that a SM needs around 200 threads to hide latencies and then imagine the huge amount of register memory the GPU would need to have :)

Fermi will have a ratio of registers/processor like G80.

Yes. No.

In the Tesla instruction set, register IDs are stored using 7 bits each in the instruction word. Register 124 always contains 0, and registers 125 to 127 probably contain 0 as well or some other constants.

So you can use only the architectural registers R0 to R123.

Even if you could run threads with more registers than that, there wouldn’t be much point in doing so. 64 threads per SM is not enough to saturate the arithmetic units, let alone the memory subsystem…

Moderated +5 Insightful. Thank you. ;)

That is the point. Thanks a lot. I will try to review all my variables and put those infrequent ones in shared memory and local/global memory.

less registers per core, but more per thread - you have 32k regs for 1.5k threads ~ 21 per thread for 100% occupancy, on G80 it was ~10 per thread - 8k regs for 768 threads and on gt200 it was 16 per thread - 16k regs for 1024 threads.

and for 124 regs per thread, if you use so much local memory, maybe using a little more will not slow you if you incrase occupancy? you could use more bandwidth, so it actualy could be a speed up? with 64 regs per thread you could have 8 warps per mp, it is still very low for hiding memory latency, but it will be far better from 4 warps in your case, which can’t even hide instruction latency.