Register number per thread

egg · January 19, 2009, 5:21am

Hi, all

I want to implement the 8-bit image convolution, do you have better way to do it? just I use the uchar4 to implement it, but the register number per thread is not enough to the high GPU occupancy.

Do you have any advice to reduce the number of register per thread, you know the number of register is limited and that affect the GPU occupancy. So do you have any advice to reduce the number of register? Now the register per thread of my program is 30, and I donâ€™t know why the number is too high either. Can you give me some help to reduce this? Thanks very much.

This is my code:

template<int i> __device__ float convolutionRow(float x, float y){

	return 

		tex2D(texData, x + KERNEL_RADIUS - i, y) * d_Kernel[i]

		+ convolutionRow<i - 1>(x, y);

}

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	

	uchar4 sum=make_uchar4(0,0,0,0);

		sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);

		sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);

		sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);

		sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);

	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

But when I use for loop, the register number of per thread is reduced, maybe because of reusing of temp. But the performance is low because of unrolling.

This is for loop code, and the register per thread is just 12.

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	uchar4 sum=make_uchar4(0,0,0,0);

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.x += tex2D(RowTexData, ex +0 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

{sum.y += tex2D(RowTexData, ex +1 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.z += tex2D(RowTexData, ex +2 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.w += tex2D(RowTexData,ex +3+ k, ey) * d_Kernel[KERNEL_RADIUS - k];

}	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

Do you have any approach to achieve the unrolling performance and just use the number of register per thread in for loop way?

Thanks very much.

liv · January 22, 2009, 11:22pm

Hi, all

I want to implement the 8-bit image convolution, do you have better way to do it? just I use the uchar4 to implement it, but the register number per thread is not enough to the high GPU occupancy.

Do you have any advice to reduce the number of register per thread, you know the number of register is limited and that affect the GPU occupancy. So do you have any advice to reduce the number of register? Now the register per thread of my program is 30, and I donâ€™t know why the number is too high either. Can you give me some help to reduce this? Thanks very much.

But when I use for loop, the register number of per thread is reduced, maybe because of reusing of temp. But the performance is low because of unrolling.

This is for loop code, and the register per thread is just 12.

Do you have any approach to achieve the unrolling performance and just use the number of register per thread in for loop way?

Thanks very much.

Some observations:

reusing a variable in your kernel does not necessarily result in a register saving and, in general, saving registers is a tricky endeavor because you’re fighting the CUDA optimizer. Even if you reuse a variable, you’ll see that in the PTX file produced after compilation that it always – at least in my experience – gets a new register (registers are not reused as this makes it easier for the optimizer to do its job)
to really see the register allocation in the executable code, use decuda on the .cubin file; yuo’ll need the PTX manual handy :blink:
is KERNEL_RADIUS a variable? If it is, make it a constant that covers your largest convo and pad it with zeros. Depending on the loop counts, a larger predictable loop is many times more efficient than a shorter unpredictable one.

_Big_Mac · January 23, 2009, 11:06am

PTX files are not the code that hits your GPU. Optimizing register reusage is done during compilation from ptx to cubin.

liv · February 3, 2009, 6:13pm

That’s what I said, perhaps not very clearly though.

Topic		Replies	Views
Register number per thread Reduce register per thread CUDA Programming and Performance	1	1686	January 19, 2009
Code optimization for Register usage reduction CUDA optimization CUDA Programming and Performance	0	1375	December 5, 2007
Reduce the number of register CUDA Programming and Performance	1	2385	March 21, 2012
reducing the number of used registers CUDA Programming and Performance	8	6331	September 22, 2009
Number of registers per thread massively increase on loops CUDA Programming and Performance	7	4215	May 1, 2010
Problem with reducing registers CUDA Programming and Performance	6	631	June 22, 2011
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5497	July 4, 2007
Is it possible to use more than 124 registers in kernel? CUDA Programming and Performance	15	4159	October 16, 2009
question about register and performance CUDA Programming and Performance	3	6723	September 22, 2008
Register Usage of my program To optimize scheduling of my program CUDA Programming and Performance	5	5916	December 15, 2007

Register number per thread

Related topics