Getting nvcc to consolidate registers

I’ve been trying to optimize some CUDA code and I’ve noticed that nvcc (v2.1) seems to allocate new registers for new temporary calculations rather than reusing previously-allocated ones. This issue seems to propagate all the way to the final binary code based on what I see in the CUDA profiler register allocation output.

Here’s some simplified code that illustrates the issue:

__global__

void foo(float *o) {

  __shared__ float a[16];

  int const x = floorf(0.5f);

  int const y = ceilf(0.5f);

  a[threadIdx.x+x] = threadIdx.y; // threadIdx.x+x --> %rd7

  a[threadIdx.x+y] = threadIdx.z; // threadIdx.x+y --> %rd10

  o[threadIdx.x] = a[threadIdx.x];

}

If I compile it with “nvcc -c t.cu -keep -O3 -o t.o” and then examine the PTX output, the two commented lines from above become:

cvt.u64.u32 	%rd5, %r7;	   	// 

	mul.lo.u64 	%rd6, %rd5, 4;		// 

	add.u64 	%rd7, %rd1, %rd6;		// 

	st.shared.f32 	[%rd7+0], %f4; 	// id:26 __cuda_a8+0x0

	.loc	14	8	0

	cvt.u64.u32 	%rd8, %r1;	   	// 

	mul.lo.u64 	%rd9, %rd8, 4;		// 

	add.u64 	%rd10, %rd9, %rd1;   	// 

	ld.shared.f32 	%f7, [%rd10+0];	// id:27 __cuda_a8+0x0

%rd7 holds the index for the first shared memory store. I would have expected the compiler to re-use %rd7 to hold the index for the second shared memory store, but instead it allocates a new register, %rd10, for that task.

The only solution that I’ve found so far is to set --maxrregcount manually for the whole compilation unit, which has some obvious downsides.

Is there any way to get nvcc to automatically consolidate registers so that new ones are not allocated when there are already existing ones that can be reused (because they will never be referenced again)?

PTX is an intermediate language, not the final assembly output. Use decuda to verify your assumption.

Consensus here, so far, has been that register reuse is done in the final stage of translating the PTX code to native machine instructions.

However I have often been able to reduce register usage at the PTX level by carefully making selected local variables “volatile”- it effects compiler optimization such that the compiler puts the value into a register immediately. I even do this for constants (e.g. 1.0 or 0.0) that are needed more than once. This saves registers because constants usually keep getting loaded into registers over and over - even if the same constant has been loaded previously. The volatile trick is a nice workaround - however I have only tested it with the 1.1 and 2.0 SDK so far.

Christian

nvcc outputs PTX using static-single assignment (see Wikipedia), which is a common intermediate form to assist in optimization and register assignment (both of which are done in ptxas).

Thanks for the extra info and suggestions, chbhner1 and seibert.

I have been trying to optimize my code as well by reducing my register overhead. Though I have not examined the assembly, by studying the ptxas info, it appears that all the registers are not reused and new ones are claimed. This is most obvious when I call a function twice back to back, the second call increases the number of registers even though any temporary variables needed by the function should have been acquired and released by the first call.

If this is not in the FAQ, it should be. ptx allocates a new register for every new variable it encounters. In the compilation phase from ptx to device code ptxas optimizes this.

To clarify, if you want to see actual register usage of your code, you should be looking at the .cubin file, not the .ptx file. The cubin is the final binary that gets loaded into the GPU, and also contains the final register usage count, as well as shared mem and local mem usage.

brilliant idea !

using this trick I decreased register usage from 36 to 29,

so that having 128 threads per block I can run 4 blocks per SM now, thanks…

Could you guys give an example how you decrease the register usage with “volatile” keyword? I gave it a try, but the only register usage reduction I achieved resulted in usage of local memory.

It works best for intermediate variables, look counters, indexes etc. Make these volatile. Examples follow.

Don’t make arrays volatile.

// bogus array access example

// unoptimized version

// often the compiler "inlines" the computation needlessly, leading to longer PTX and extra register use.

int tmp = blockIdx.x + 7;

g_array1[tmp] = x;   // inlined as g_array1[blockIdx.x + 7] = x;

g_array2[tmp] = y;   // inlined as g_array1[blockIdx.x + 7] = y;

// optimized version: tmp is stored in a register and used in both array accesses

// look at PTX to see why this is better.

volatile int tmp = blockIdx.x + 7;

g_array1[tmp] = x;

g_array2[tmp] = y;

// bogus computation example 

// unoptimized version of some bogus computation

x = cos ( arg - 1.0 );

y = sin ( arg - 1.0 );

// optimized version: advantage becomes apparent when you look at the PTX

volatile int one = 1.0f; // declare such constants early on, re-use often (where applicable)

x = cos ( arg - one );

y = sin ( arg - one );

// unoptimized initialization example:

// PTX may allocate up to 3 extra registers (first load constant to register, then assign constant to target location)

var1 = 0.0f;

var2 = 0.0f;

var3 = 0.0f;

// optimized initialization example

// better register use: PTX allocates 1 extra register for constant 0

volatile int zero = 0.0f;

var1 = zero;

var2 = zero;

var3 = zero;

The PTX allocates less registers and the final optimization of register use that happens during translation from PTX to machine code may add some extra efficiency. Without the volatile trick the PTX allocates more registers initially and the “peephole” optimizer (or whatever algorithm is used) does not achieve the same efficiency as if we had helped manually.

Thanks a lot. I tried it, and it worked wonderfully.

Summarizing all of the above examples in one sentence:

For variables with local scope (that are not forced to local memory like arrays would) the volatile keyword is essentially

the missing “__register” keyword.

This thread was tremendously useful, it works great!

Could you give me an official reference to include it in my work, because I couldn’t find anything helpfull about the volatile keyword in the programming guide, the only thing I found was:

“Only after the execution of a __syncthreads() (Section 4.4.2) are writes to shared variables guaranteed to be visible by other threads. Unless the variable is declared as volatile, the compiler is free to optimize the reads and writes to shared memory as long as the previous statement is met.”

and in the ptx_isa_1.2 there is a reference to this reserved keyword and I quote:

"st.volatile may be used with .global and .shared spaces to inhibit optimization of

references to volatile memory. This may be used, for example, to enforce sequential

consistency between threads accessing shared memory."

But neither of them clear the things out for me,

thanks a lot!

Sorry, there is no official reference on use of “volatile” keyword for reducing register pressure. All I know about this is from experience (looking at PTX output before and after inserting volatile). So far it has worked in all SDK versions, and it probably will continue to work until nVidia significantly improve their compiler.

Thanks so much for sharing this trick, it really helps, dropped my troublesome kernel from 20 to 15 registers at the cost of quite a bit of speed for some reason.

Looking at the PTX code, (can’t get decuda to work with G10), I am amazed at how many CVT and MOV commands are wasting registers. I’ve tried using unsigned shorts to prevent the CVT commands, this results in an increase in the %rh counts ( which I assume is a half-register) and switched from unsigned to signed ints to try to prevent MOV commands that are only moving an unsigned int to a signed int register. I know that optimization takes place after PTX, but without being able to look at my cubin with decuda PTX is all I have at the moment.

I guess I could work on the decuda code to get it working but I’m not a python programmer so that would take a while. Any tricks that you’ve learned to cut down on some of the CVT and MOV commands?

  • Richard

[Conclusion] Volatile trick is actually NOT good, I think.

[My assurance] Volatiled register variable might behave as a volatiled local variable.

  • By using volatile trick, it’s sure that register usage get reduced, but this is just because local memory is used instead of register.

  • Volatile trick goes the application slower. I confirmed by an experiment.

[Evidence of my assurance 1] CUDA Programming guide 4.2 D.2.1.2 (p126)
It says,

“if a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.”

So the behavior of volatiled register is not defined officially. But it’s true that volatile trick changes the application’s performance, which is lower, and register usage, which is also lower.
But, it can be supposed that register variables compile to an actual memory read that is local memory.
Therefore local memory is corresponding to register.

(Please note that EVEN IF the volatile trick makes the occupancy higher, the performance will goes down, in my experience. )

[Evidence of my assurance 2]
I made 3 programs which measure the execution time of some instruction by clock().

One uses a volatiled register valiable.
global kernel(){
volatile double a = 2.0;

}

Another uses a local memory variable.
global kernel(){
double a[2];
a[0] = 2.0;

}

The other uses a volatiled local memory variable.
global kernel(){
volatile double a[2];
a[0] = 2.0;

}

All of these 3 programs made the same result (output the same clock cycles).
And the first code uses 4, second uses 2, third uses 4 registers in the application.

It shows that volatiled register variables is equal to volatiled local memory, I think.

Use of “volatile” can easily force the object thus marked into local memory. You save a register, and in exchange access to the variable becomes slow since it needs to be loaded from memory.

The generated PTX code uses many registers because the frontends generate SSA-style code, where a new register is used for each new result written. See:

http://en.wikipedia.org/wiki/Static_single_assignment_form

The registers used by PTX are virtual registers and have nothing to do with the actual hardware registers being used by the generated machine code (SASS). PTXAS translates PTX to SASS and performs register allocation and instruction scheduling, among other things.

Unnecessary MOV and CVT instructions you see at PTX level should be optimized away by PTXAS. You can check by disassembling the SASS code with cuobjdump --dump-sass.

I will add to what the two posts above advise with my own personal experience.

Please see post https://devtalk.nvidia.com/default/topic/522728/nsight-visual-studio-edition/-34-local-34-memory-statistics/ for the whole story!

Ailleur

I’m glad that our posts help your work. I saw all of your story.

Actually, I couldn’t assert that volatiled register is equal to the volatiled “local” memory because if I use shared memory instead of local memory, the application result was the same.

But when you stop using volatile trick, local memory usage becomes zero, so I could make sure just now.

Thanks.

njuffa

And thank you, too. Actually I’m not familiar with PTX code, but I could improve my image of PTX by your post.

Thanks a lot.

Wooha! Someone revived a zombie thread from 2009. I can only recommend using the volatile trick up to CUDA SDK 2.3 - and as a last resort to ease register pressure.

When migrating your code to the later SDKs (in particular those with the LLVM based compiler), definitely throw that keyword out ;)

Christian