Register number per thread Reduce register per thread

Hi, all

Do you have any advice to reduce the number of register per thread, you know the number of register is limited and that affect the GPU occupancy. So do you have any advice to reduce the number of register? Now the register per thread of my program is 30, and I don’t know why the number is too high either. Can you give me some help to reduce this? Thanks very much.

This is my code:

template<int i> __device__ float convolutionRow(float x, float y){

	return 

		tex2D(texData, x + KERNEL_RADIUS - i, y) * d_Kernel[i]

		+ convolutionRow<i - 1>(x, y);

}

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	

	uchar4 sum=make_uchar4(0,0,0,0);

		sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);

		sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);

		sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);

		sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);

	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

But when I use for loop, the register number of per thread is reduced, maybe because of reusing of temp. But the performance is low because of unrolling.

This is for loop code, and the register per thread is just 12.

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	uchar4 sum=make_uchar4(0,0,0,0);

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.x += tex2D(RowTexData, ex +0 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

{sum.y += tex2D(RowTexData, ex +1 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.z += tex2D(RowTexData, ex +2 + k, ey) * d_Kernel[KERNEL_RADIUS - k];

}

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){

sum.w += tex2D(RowTexData,ex +3+ k, ey) * d_Kernel[KERNEL_RADIUS - k];

}	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

Do you have any approach to achieve the unrolling performance and just use the number of register per thread in for loop way?

Thanks very much.

Try adding

#pragma unroll 4

before each loop. Replace 4 with the number of iterations (2*KERNEL_RADIUS ?)

I believe the compiler has problems optimizing the convolutionRow function call (1st version) because the function itself is non-trivial. There’s some kind of “template recursion” which is hard to figure out, especially since nvcc isn’t necessarily used to recursion. By the way, you probably know that normal recursion is not supported in device and globabl functions.

You might try simplifying convolutionRow (make it explicit) and have your kernel call it as in the first version.

Still, it’s likely that your main bottleneck is memory access (writing d_Result) - it might not be coalesced.

PS. Don’t ask the same question in different forums, once is enough.