Hi, all
I want to implement the 8-bit image convolution, do you have better way to do it? just I use the uchar4 to implement it, but the register number per thread is not enough to the high GPU occupancy.
Do you have any advice to reduce the number of register per thread, you know the number of register is limited and that affect the GPU occupancy. So do you have any advice to reduce the number of register? Now the register per thread of my program is 30, and I don’t know why the number is too high either. Can you give me some help to reduce this? Thanks very much.
This is my code:
template<int i> __device__ float convolutionRow(float x, float y){
return
tex2D(texData, x + KERNEL_RADIUS - i, y) * d_Kernel[i]
+ convolutionRow<i - 1>(x, y);
}
__global__ void convolutionRowGPU(
uint4 *d_Result,
int dataW,
)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float ex = (float)IMUL(ix, 16)+0.5;
const float ey = (float)iy + 0.5f;
uchar4 sum=make_uchar4(0,0,0,0);
sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);
sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);
sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);
sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);
d_Result[IMUL(iy, dataW) + ix] = sum;
}
But when I use for loop, the register number of per thread is reduced, maybe because of reusing of temp. But the performance is low because of unrolling.
This is for loop code, and the register per thread is just 12.
__global__ void convolutionRowGPU(
uint4 *d_Result,
int dataW,
)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float ex = (float)IMUL(ix, 16)+0.5;
const float ey = (float)iy + 0.5f;
uchar4 sum=make_uchar4(0,0,0,0);
for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){
sum.x += tex2D(RowTexData, ex +0 + k, ey) * d_Kernel[KERNEL_RADIUS - k];
}
for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)
{sum.y += tex2D(RowTexData, ex +1 + k, ey) * d_Kernel[KERNEL_RADIUS - k];
}
for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){
sum.z += tex2D(RowTexData, ex +2 + k, ey) * d_Kernel[KERNEL_RADIUS - k];
}
for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++){
sum.w += tex2D(RowTexData,ex +3+ k, ey) * d_Kernel[KERNEL_RADIUS - k];
}
d_Result[IMUL(iy, dataW) + ix] = sum;
}
Do you have any approach to achieve the unrolling performance and just use the number of register per thread in for loop way?
Thanks very much.