How to reduce register usage

Hi,

I have been trying to optimise my kernels, and it’s proving very difficult.

I have coalesced memory accesses, and halved the number of texture cache misses. However, I always end up using slightly too many registers, which brings the occupancy way down, and so I only get a minor speedup.

For example, here is one of my kernels below. Previously it was using 16 registers, and I could get 100% occupancy. Then I changed the source texture datastructure slightly, and now the current version uses 17 registers and I can only get 1/3 occupancy, so I’ve only gotten a small speed up.

It seems like what ever optimisations I do, I end up using more registers which cancels out any benefits (or even makes the kernel slower). I can’t find a strategy for decreasing the register usage. I’ve tried fiddling with the order of calculations, avoiding named variables for intermediate results, using loops, unrolling loops. None of them seem to have a predictable effect, I have literally spent hours trying different things, and I still don’t have a good model of how the compiler assigns registers. Do you you guys have any suggestions? Or maybe I am expected too much, and 100% occupancy is too difficult?

/* The block size is 32x8 */

KERNEL void ahd_kernel_interp_g(pixel4* g_horz_res, pixel4* g_vert_res, int width, int height)

{

	uint x = blockIdx.x*blockDim.x + threadIdx.x;

	uint y = blockIdx.y*blockDim.y + threadIdx.y;

	if (x < 2 || y < 2 || x >= width-2 || y >= height-2) {

		return;

	}

	int filter_color = get_filter_color(x,y);

	int mulB = filter_color == B;

	int mulR = filter_color == R;

	int mulG = filter_color == G;

	char4 h_res, v_res;

	/* Copy existing value to output */

	h_res.x = v_res.x = mulB * tex2D(src,x,y);

	h_res.y = v_res.y = mulG * tex2D(src,x,y);

	h_res.z = v_res.x = mulR * tex2D(src,x,y);

	/* Interpolate Green values first */

	if (filter_color == R || filter_color == B) {

		/* Filter color is red or blue Interpolate green channel horizontally */

		/* Use existing green values */

		float sum = (tex2D(src,x-1,y) +

					 tex2D(src,x+1,y))/2.0f;

		/* And use existing red/blue values and apply filter 'h' */

		sum += (-tex2D(src,x-2,y)/4.0f +

				 tex2D(src,x,  y)/2.0f +

				-tex2D(src,x+2,y)/4.0f)/4.0f;

		h_res.y = (uchar)clampc(sum);

		/* Interpolate green channel vertically */

		/* Use existing green values */

		sum = (tex2D(src,x,y-1) +

			   tex2D(src,x,y+1))/2.0f;

		/* And use existing red/blue values and apply filter 'h' */

		sum += (-tex2D(src,x,y-2)/4.0f +

				 tex2D(src,x,y  )/2.0f +

				-tex2D(src,x,y+2)/4.0f)/4.0f;

		v_res.y = (uchar)clampc(sum);

	}

	int res_index = (y*width + x);

	g_horz_res[res_index] = h_res;

	g_vert_res[res_index] = v_res;

}

try this

volatile uint x = blockIdx.x*blockDim.x + threadIdx.x;

	volatile uint y = blockIdx.y*blockDim.y + threadIdx.y;

and

volatile int res_index = (y*width + x);

I made a forum posting about fighting register pressure. Maybe that can be helpful to you.

http://www.lmgtfy.com/?q=cuda+tricks+to+fi…gister+pressure

would this work better, maybe? I tried manually untangling the dependencies

(however the compiler should be smart about it already, unless one abuses the volatile keyword)

volatile int filter_color = get_filter_color(x,y);

	int mulB = filter_color == B;

	h_res.x = v_res.x = mulB * tex2D(src,x,y);

	int mulG = filter_color == G;

	h_res.y = v_res.y = mulG * tex2D(src,x,y);

	int mulR = filter_color == R;

	h_res.z = v_res.x = mulR * tex2D(src,x,y);

by the way is the v_res.x in the last line a typo? I would expect v_res.z here

Here is a nice trick that I have almost never seen described anywhere

volatile float two = 2.0f;

volatile float four = 4.0f;

then replace occurences of 2.0f and 4.0f in the following code with two and four respectively.

You will need to look at the PTX code to really understand what’s going on.

Hi cbuchner,

Thanks for the advice, I till try it with volatile.

I was able to get rid of the 1 register, with following change.

int filter_color = get_filter_color(x,y);

//	int mulB = filter_color == B;

//	int mulR = filter_color == R;

//	int mulG = filter_color == G;

	char4 h_res, v_res;

	/* Copy existing value to output */

	h_res.x = v_res.x = (filter_color == R) * tex2D(src,x,y);

	h_res.y = v_res.y = (filter_color == G) * tex2D(src,x,y);

	h_res.z = v_res.z = (filter_color == B) * tex2D(src,x,y);

hmm, the order used to be B, G, R … not R, G, B !

one final comment…

Wouldn’t *0.25f and *0.5f be substantially faster than a floating point division by 4.0f and 2.0f respectively?
But if this kernel is bounded by memory throughput, one wouldn’t notice a difference anyway.

Christian

I have a kernel that uses 60 registers, but when I count them myself I end up with something like 30. Is it possible to see exactly how the compiler counts the registers?

I tried some volatile declarations and “float zero = 0.0f” but it does not make any difference. I need to get down to 32 registers to run 512 threads.

I have a kernel that uses 60 registers, but when I count them myself I end up with something like 30. Is it possible to see exactly how the compiler counts the registers?

I tried some volatile declarations and “float zero = 0.0f” but it does not make any difference. I need to get down to 32 registers to run 512 threads.

Even in the PTX code you can’t see the final register count. What you see there is the result of a static single assignment strategy (i.e. each register gets assigned only once). Thousands of registers used are very common in the PTX code.

Internally the PTX assembler (PTXAS) builds itself a huge data structure that has all the dependencies and lifetimes of various registers. Only then the final register count is determined by running an optimizer that tries to make efficient re-use of actual hardware registers. The final register count would be the maximum number of concurrently required registers at any given point in the algorithm control flow.

However some of previously stated tricks may make it easier for this optimizer to be effective - for example if you actively limit the scope and lifetime of your variables. Or by reducing redundant intermediate computations (volatile index variables help a lot here). Or by removing redundant loading of constants into registers. etc. Some of the tricks already have a significant reducing effect on the total register count in the PTX file. Based on this reduced complexity “optimization problem” the optimizer in PTXAS can do a better job.

To get down from 60 to 32 sounds pretty difficult to me. Can you post kernel code, or is this stuff proprietary?

I wish they open sourced the optimizer part of PTXAS, I would be OK with the code generator staying closed (e.g shipping as .a file to link against)

Christian

Even in the PTX code you can’t see the final register count. What you see there is the result of a static single assignment strategy (i.e. each register gets assigned only once). Thousands of registers used are very common in the PTX code.

Internally the PTX assembler (PTXAS) builds itself a huge data structure that has all the dependencies and lifetimes of various registers. Only then the final register count is determined by running an optimizer that tries to make efficient re-use of actual hardware registers. The final register count would be the maximum number of concurrently required registers at any given point in the algorithm control flow.

However some of previously stated tricks may make it easier for this optimizer to be effective - for example if you actively limit the scope and lifetime of your variables. Or by reducing redundant intermediate computations (volatile index variables help a lot here). Or by removing redundant loading of constants into registers. etc. Some of the tricks already have a significant reducing effect on the total register count in the PTX file. Based on this reduced complexity “optimization problem” the optimizer in PTXAS can do a better job.

To get down from 60 to 32 sounds pretty difficult to me. Can you post kernel code, or is this stuff proprietary?

I wish they open sourced the optimizer part of PTXAS, I would be OK with the code generator staying closed (e.g shipping as .a file to link against)

Christian

In general I think no reliable (and sane at the same time) method exists. You need to stick to maxrregcount until NVIDIA fixes and/or gives us more control over instruction scheduling and register allocation.

Just for the fun of messing with ptxas, look at the kernel (results for toolkit 2.3):

[codebox]global void schedsucks(uint32_t* result){

#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

}

#pragma unroll

for(int i=1;i<N;i++){

	A[i]^=i;

}

#pragma unroll

for(int i=1;i<N;i++){

	result[i]=A[i];

}

}[/codebox]

That one compiles to :

ptxas info : Used 58 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

And register usage scales with N.

But if you merge the loops:

[codebox]

global void schedsucks(uint32_t* result){

#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

	A[i]^=i;

	result[i]=A[i];

}

}

[/codebox]

then suddenly it compiles it to:

ptxas info : Used 2 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

In general I think no reliable (and sane at the same time) method exists. You need to stick to maxrregcount until NVIDIA fixes and/or gives us more control over instruction scheduling and register allocation.

Just for the fun of messing with ptxas, look at the kernel (results for toolkit 2.3):

[codebox]global void schedsucks(uint32_t* result){

#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

}

#pragma unroll

for(int i=1;i<N;i++){

	A[i]^=i;

}

#pragma unroll

for(int i=1;i<N;i++){

	result[i]=A[i];

}

}[/codebox]

That one compiles to :

ptxas info : Used 58 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

And register usage scales with N.

But if you merge the loops:

[codebox]

global void schedsucks(uint32_t* result){

#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

	A[i]^=i;

	result[i]=A[i];

}

}

[/codebox]

then suddenly it compiles it to:

ptxas info : Used 2 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

You can do this with Ocelot. It comes with a PTXOptimizer that can do register allocation on PTX and show you the locations of the spills/fills. It might not be exactly the same thing as ptxas, but it should at least show you where the register pressure in your application is.

To use this, you need to

  1. install Ocelot

  2. generate a ptx assembly file for your kernel. nvcc --ptx -arch sm_13 your_kernel.cu

  3. Run the PTXOptimizer on it: PTXOptimizer -i=your_kernel.ptx -a=linearscan -o=output_file_name.ptx -r=count

  4. Examine the resulting PTX assembly to find the spills/fills. You can set the max number of hardware registers (-r) to something very high to see the register requirements of the kernel.

For example:

my_kernel.cu

__global__ void test_kernel(float* output, float* input, unsigned int index)

{

	if(index < 128)

	{

		output[index] = input[index];

	}

}

my_kernel.ptx

.version 2.0

	.target sm_20

...

	.entry _Z11test_kernelPfS_j (

		.param .u32 __cudaparm__Z11test_kernelPfS_j_output,

		.param .u32 __cudaparm__Z11test_kernelPfS_j_input,

		.param .u32 __cudaparm__Z11test_kernelPfS_j_index)

	{

	.reg .u32 %r<9>;

	.reg .f32 %f<3>;

	.reg .pred %p<3>;

	.loc	27	1	0

$LBB1__Z11test_kernelPfS_j:

	ld.param.u32 	%r1, [__cudaparm__Z11test_kernelPfS_j_index];

	mov.u32 	%r2, 127;

	setp.gt.u32 	%p1, %r1, %r2;

	@%p1 bra 	$Lt_0_1026;

	.loc	27	5	0

	mul.lo.u32 	%r3, %r1, 4;

	ld.param.u32 	%r4, [__cudaparm__Z11test_kernelPfS_j_input];

	add.u32 	%r5, %r4, %r3;

	ld.global.f32 	%f1, [%r5+0];

	ld.param.u32 	%r6, [__cudaparm__Z11test_kernelPfS_j_output];

	add.u32 	%r7, %r6, %r3;

	st.global.f32 	[%r7+0], %f1;

$Lt_0_1026:

	.loc	27	8	0

	exit;

$LDWend__Z11test_kernelPfS_j:

	} // _Z11test_kernelPfS_j

optimized.ptx (infinite-regs)

.version 1.4

.target sm_13

/* Module register-example.ptx */

/* Globals */

/* Textures */

/*

* Ocelot Version : 1.1.560

*/

.entry _Z11test_kernelPfS_j(.param  .u32 __cudaparm__Z11test_kernelPfS_j_output,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_input,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_index)

{

	.reg .u32 %r0;

	.reg .u32 %r1;

	.reg .pred %p2;

	.reg .u32 %r4;

	.reg .u32 %r5;

	.reg .u32 %r6;

	.reg .f32 %r7;

	.reg .u32 %r8;

	.reg .u32 %r9;

	$BB_1_1:				/* $LBB1__Z11test_kernelPfS_j */ 

		ld.param.u32 %r0, [__cudaparm__Z11test_kernelPfS_j_index];

		mov.u32 %r1, 127;

		setp.gt.u32 %p2, %r0, %r1;

		@%p2 bra $BB_1_3;

	$BB_1_2:

		mul.lo.u32 %r4, %r0, 4;

		ld.param.u32 %r5, [__cudaparm__Z11test_kernelPfS_j_input];

		add.u32 %r6, %r5, %r4;

		ld.global.f32 %r7, [%r6 + 0];

		ld.param.u32 %r8, [__cudaparm__Z11test_kernelPfS_j_output];

		add.u32 %r9, %r8, %r4;

		st.global.f32 [%r9 + 0], %r7;

	$BB_1_3:				/* $Lt_0_1026 */ 

		exit;

}

optimized-ptx (5 registers total)

.version 1.4

.target sm_13

/* Module register-example.ptx */

/* Globals */

/* Textures */

/*

* Ocelot Version : 1.1.560

*/

.entry _Z11test_kernelPfS_j(.param  .u32 __cudaparm__Z11test_kernelPfS_j_output,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_input,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_index)

{

	.local .u8 _Zocelot_linear_scan_register_allocation_stack[4];

	.reg .u32 %r0;

	.reg .u32 %r1;

	.reg .pred %p2;

	.reg .u32 %r4;

	.reg .u32 %r5;

	.reg .u32 %r6;

	.reg .f32 %r16;

	.reg .u32 %r8;

	.reg .u32 %r9;

	.reg .f32 %r19;

	$BB_1_1:				/* $LBB1__Z11test_kernelPfS_j */ 

		ld.param.u32 %r0, [__cudaparm__Z11test_kernelPfS_j_index];

		mov.u32 %r1, 127;

		setp.gt.u32 %p2, %r0, %r1;

		@%p2 bra $BB_1_3;

	$BB_1_2:

		mul.lo.u32 %r4, %r0, 4;

		ld.param.u32 %r5, [__cudaparm__Z11test_kernelPfS_j_input];

		add.u32 %r6, %r5, %r4;

		ld.global.f32 %r16, [%r6 + 0];

		st.local.f32 [_Zocelot_linear_scan_register_allocation_stack], %r16;

		ld.param.u32 %r8, [__cudaparm__Z11test_kernelPfS_j_output];

		add.u32 %r9, %r8, %r4;

		ld.local.f32 %r19, [_Zocelot_linear_scan_register_allocation_stack];

		st.global.f32 [%r9 + 0], %r19;

	$BB_1_3:				/* $Lt_0_1026 */ 

		exit;

}

You can use the CFG or DFG tools that come with Ocelot to visualize it as in: http://www.gdiamos.net/files/_Z11test_kern…S_j_cfg.dot.pdf

You can do this with Ocelot. It comes with a PTXOptimizer that can do register allocation on PTX and show you the locations of the spills/fills. It might not be exactly the same thing as ptxas, but it should at least show you where the register pressure in your application is.

To use this, you need to

  1. install Ocelot

  2. generate a ptx assembly file for your kernel. nvcc --ptx -arch sm_13 your_kernel.cu

  3. Run the PTXOptimizer on it: PTXOptimizer -i=your_kernel.ptx -a=linearscan -o=output_file_name.ptx -r=count

  4. Examine the resulting PTX assembly to find the spills/fills. You can set the max number of hardware registers (-r) to something very high to see the register requirements of the kernel.

For example:

my_kernel.cu

__global__ void test_kernel(float* output, float* input, unsigned int index)

{

	if(index < 128)

	{

		output[index] = input[index];

	}

}

my_kernel.ptx

.version 2.0

	.target sm_20

...

	.entry _Z11test_kernelPfS_j (

		.param .u32 __cudaparm__Z11test_kernelPfS_j_output,

		.param .u32 __cudaparm__Z11test_kernelPfS_j_input,

		.param .u32 __cudaparm__Z11test_kernelPfS_j_index)

	{

	.reg .u32 %r<9>;

	.reg .f32 %f<3>;

	.reg .pred %p<3>;

	.loc	27	1	0

$LBB1__Z11test_kernelPfS_j:

	ld.param.u32 	%r1, [__cudaparm__Z11test_kernelPfS_j_index];

	mov.u32 	%r2, 127;

	setp.gt.u32 	%p1, %r1, %r2;

	@%p1 bra 	$Lt_0_1026;

	.loc	27	5	0

	mul.lo.u32 	%r3, %r1, 4;

	ld.param.u32 	%r4, [__cudaparm__Z11test_kernelPfS_j_input];

	add.u32 	%r5, %r4, %r3;

	ld.global.f32 	%f1, [%r5+0];

	ld.param.u32 	%r6, [__cudaparm__Z11test_kernelPfS_j_output];

	add.u32 	%r7, %r6, %r3;

	st.global.f32 	[%r7+0], %f1;

$Lt_0_1026:

	.loc	27	8	0

	exit;

$LDWend__Z11test_kernelPfS_j:

	} // _Z11test_kernelPfS_j

optimized.ptx (infinite-regs)

.version 1.4

.target sm_13

/* Module register-example.ptx */

/* Globals */

/* Textures */

/*

* Ocelot Version : 1.1.560

*/

.entry _Z11test_kernelPfS_j(.param  .u32 __cudaparm__Z11test_kernelPfS_j_output,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_input,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_index)

{

	.reg .u32 %r0;

	.reg .u32 %r1;

	.reg .pred %p2;

	.reg .u32 %r4;

	.reg .u32 %r5;

	.reg .u32 %r6;

	.reg .f32 %r7;

	.reg .u32 %r8;

	.reg .u32 %r9;

	$BB_1_1:				/* $LBB1__Z11test_kernelPfS_j */ 

		ld.param.u32 %r0, [__cudaparm__Z11test_kernelPfS_j_index];

		mov.u32 %r1, 127;

		setp.gt.u32 %p2, %r0, %r1;

		@%p2 bra $BB_1_3;

	$BB_1_2:

		mul.lo.u32 %r4, %r0, 4;

		ld.param.u32 %r5, [__cudaparm__Z11test_kernelPfS_j_input];

		add.u32 %r6, %r5, %r4;

		ld.global.f32 %r7, [%r6 + 0];

		ld.param.u32 %r8, [__cudaparm__Z11test_kernelPfS_j_output];

		add.u32 %r9, %r8, %r4;

		st.global.f32 [%r9 + 0], %r7;

	$BB_1_3:				/* $Lt_0_1026 */ 

		exit;

}

optimized-ptx (5 registers total)

.version 1.4

.target sm_13

/* Module register-example.ptx */

/* Globals */

/* Textures */

/*

* Ocelot Version : 1.1.560

*/

.entry _Z11test_kernelPfS_j(.param  .u32 __cudaparm__Z11test_kernelPfS_j_output,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_input,

		.param  .u32 __cudaparm__Z11test_kernelPfS_j_index)

{

	.local .u8 _Zocelot_linear_scan_register_allocation_stack[4];

	.reg .u32 %r0;

	.reg .u32 %r1;

	.reg .pred %p2;

	.reg .u32 %r4;

	.reg .u32 %r5;

	.reg .u32 %r6;

	.reg .f32 %r16;

	.reg .u32 %r8;

	.reg .u32 %r9;

	.reg .f32 %r19;

	$BB_1_1:				/* $LBB1__Z11test_kernelPfS_j */ 

		ld.param.u32 %r0, [__cudaparm__Z11test_kernelPfS_j_index];

		mov.u32 %r1, 127;

		setp.gt.u32 %p2, %r0, %r1;

		@%p2 bra $BB_1_3;

	$BB_1_2:

		mul.lo.u32 %r4, %r0, 4;

		ld.param.u32 %r5, [__cudaparm__Z11test_kernelPfS_j_input];

		add.u32 %r6, %r5, %r4;

		ld.global.f32 %r16, [%r6 + 0];

		st.local.f32 [_Zocelot_linear_scan_register_allocation_stack], %r16;

		ld.param.u32 %r8, [__cudaparm__Z11test_kernelPfS_j_output];

		add.u32 %r9, %r8, %r4;

		ld.local.f32 %r19, [_Zocelot_linear_scan_register_allocation_stack];

		st.global.f32 [%r9 + 0], %r19;

	$BB_1_3:				/* $Lt_0_1026 */ 

		exit;

}

You can use the CFG or DFG tools that come with Ocelot to visualize it as in: http://www.gdiamos.net/files/_Z11test_kern…S_j_cfg.dot.pdf

I recompiled the same code with the compiler options “–keep -g -G”, then the register usage went down from 60 to 32 registers but the constant memory increased. To compile the code like this took 10 minutes instead of 2 seconds!

I recompiled the same code with the compiler options “–keep -g -G”, then the register usage went down from 60 to 32 registers but the constant memory increased. To compile the code like this took 10 minutes instead of 2 seconds!

I would imagine that this kernel will also be significantly slower if you try to run it. Debug mode disables optimization to make sure that all variables in CUDA are visible to the debugger and are computed in the same order that they are computed in the source code.

I would imagine that this kernel will also be significantly slower if you try to run it. Debug mode disables optimization to make sure that all variables in CUDA are visible to the debugger and are computed in the same order that they are computed in the source code.

Haha! “schedsucks” is a great name for this test case.

To me it is clear where the the need for 29 registers comes from. It has to keep all
the values of A[1] to A[29] in registers - you’re lucky it did not move the entire array
to local memory. So far I thought large arrays always go to local memory.

I have no clue why it would use twice the amount of registers that I think it should.
So 58 is definitely not cool.

Christian