What, other than variables, consume registers? Help me understand where my register usage is going

Can you explain what constructs consume registers in a kernel?

The reason is I just did a lot of [font=“Courier New”]#define[/font]'s, making my code recalculate a whole bunch of variables each time they are needed, thinking it would half register usage. But was very disappointing to discovered it had NIL effect on register usage!

And then I tried some code like this, which reported 5 registers used:

__global__ void test_kernel ( int* d_data)

{

    int a = 0;

    int b = 1;

    d_data[0] = a + b;

}

But then this has identically 5 registers used:

__global__ void test_kernel ( int* d_data)

{

//  int a = 0;

//  int b = 1;

    d_data[0] = 1;

}

I don’t understand why they would be the same!

Thanks in advance,

Mike

ptxas does a lot of optimization for you. a and b will never exist in registers. They are always directly used as constants. So the first code and the second code are equivalent to each other.

Thanks hyqneuron.
I knew the compiler would optimise out and/or reuse registers where it can. So this was a pretty useless example i guess.

But in my situation, where in my kernel I have been manually indexing between 3D and linear array coordinates, and removing ALL of the variables associated with these using [font=“Courier New”]#define[/font]'s has made zero difference, it must be that those variables aren’t consuming registers anyway?

Does that mean there is little chance of being able to reduce register usage further?

Built-in functions that you call can consume registers, but there is little you can do about that (unless the reduced precision of the __ version of the intrinsics is good enough for your purpose).

Another way to think about this problem is that registers are not for holding variables, they are for holding operands and values of expressions and subexpressions (including logical ones). Complex expressions will need more registers than simple ones. Using #define tricks to remove variables can’t help because all you are doing is inlining the expressions that you previously assigned to a named variable. The same instruction sequence has to be executed either way, and each of those instructions needs to pull operands from some registers and write a result to another register.

Reducing register usage in a kernel is very hard and generally unstable to small source changes. You can use the --maxrregcount parameter to nvcc to force the PTX assembler to reduce register usage by spilling intermediate values to local memory. There is, of course, a speed penalty for local memory, but it is worth experimenting with.

Another way to think about this problem is that registers are not for holding variables, they are for holding operands and values of expressions and subexpressions (including logical ones). Complex expressions will need more registers than simple ones. Using #define tricks to remove variables can’t help because all you are doing is inlining the expressions that you previously assigned to a named variable. The same instruction sequence has to be executed either way, and each of those instructions needs to pull operands from some registers and write a result to another register.

Reducing register usage in a kernel is very hard and generally unstable to small source changes. You can use the --maxrregcount parameter to nvcc to force the PTX assembler to reduce register usage by spilling intermediate values to local memory. There is, of course, a speed penalty for local memory, but it is worth experimenting with.

Two thoughts don’t know if they will help or not

Microbenchmarking shows that the number of threads per block changes the number of registers a thread can have
(ranges from 32 to 128 registers per thread)
So reducing the number of threads per block allows more registers per thread and may make things run faster. (of course if you are using less registers than number available then why worry :)

See Fig 9 (page 6) of the paper refered to here
http://forums.nvidia.com/index.php?showtopic=200014&st=0&p=1235488&hl=microbenchmarking&fromsearch=1&#entry1235488

Also something I saw on reducing register pressure. Only if you really have to
http://forums.nvidia.com/index.php?showtopic=168974

Two thoughts don’t know if they will help or not

Microbenchmarking shows that the number of threads per block changes the number of registers a thread can have
(ranges from 32 to 128 registers per thread)
So reducing the number of threads per block allows more registers per thread and may make things run faster. (of course if you are using less registers than number available then why worry :)

See Fig 9 (page 6) of the paper refered to here
http://forums.nvidia.com/index.php?showtopic=200014&st=0&p=1235488&hl=microbenchmarking&fromsearch=1&#entry1235488

Also something I saw on reducing register pressure. Only if you really have to
http://forums.nvidia.com/index.php?showtopic=168974

Number of registers employed by a thread is determined by ptxas at compile time. For Fermi, the max is 63 (R0 to R62).

Number of registers employed by a thread is determined by ptxas at compile time. For Fermi, the max is 63 (R0 to R62).