Register number per thread

Hi, all

I want to implement the 8-bit image convolution, do you have better way to do it? just I use the uchar4 to implement it, but the register number per thread is not enough to the high GPU occupancy.

Do you have any advice to reduce the number of register per thread, you know the number of register is limited and that affect the GPU occupancy. So do you have any advice to reduce the number of register? Now the register per thread of my program is 30, and I don’t know why the number is too high either. Can you give me some help to reduce this? Thanks very much.

This is my code:

template<int i> __device__ float convolutionRow(float x, float y){

	return 

		tex2D(texData, x + KERNEL_RADIUS - i, y) * d_Kernel[i]

		+ convolutionRow<i - 1>(x, y);

}

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	

	uchar4 sum=make_uchar4(0,0,0,0);

		sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);

		sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);

		sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);

		sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);

	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

But when I use for loop, the register number of per thread is reduced, maybe because of reusing of temp. But the performance is low because of unrolling.

This is for loop code, and the register per thread is just 12.

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	uchar4 sum=make_uchar4(0,0,0,0);

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.x += tex2D(RowTexData, ex +0 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

{sum.y += tex2D(RowTexData, ex +1 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.z += tex2D(RowTexData, ex +2 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.w += tex2D(RowTexData,ex +3+ k, ey) * d_Kernel[KERNEL_RADIUS - k];

}	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

Do you have any approach to achieve the unrolling performance and just use the number of register per thread in for loop way?

Thanks very much.

Some observations:

  • reusing a variable in your kernel does not necessarily result in a register saving and, in general, saving registers is a tricky endeavor because you’re fighting the CUDA optimizer. Even if you reuse a variable, you’ll see that in the PTX file produced after compilation that it always – at least in my experience – gets a new register (registers are not reused as this makes it easier for the optimizer to do its job)

  • to really see the register allocation in the executable code, use decuda on the .cubin file; yuo’ll need the PTX manual handy :blink:

  • is KERNEL_RADIUS a variable? If it is, make it a constant that covers your largest convo and pad it with zeros. Depending on the loop counts, a larger predictable loop is many times more efficient than a shorter unpredictable one.

PTX files are not the code that hits your GPU. Optimizing register reusage is done during compilation from ptx to cubin.

That’s what I said, perhaps not very clearly though.