Hi,
I have been trying to optimise my kernels, and it’s proving very difficult.
I have coalesced memory accesses, and halved the number of texture cache misses. However, I always end up using slightly too many registers, which brings the occupancy way down, and so I only get a minor speedup.
For example, here is one of my kernels below. Previously it was using 16 registers, and I could get 100% occupancy. Then I changed the source texture datastructure slightly, and now the current version uses 17 registers and I can only get 1/3 occupancy, so I’ve only gotten a small speed up.
It seems like what ever optimisations I do, I end up using more registers which cancels out any benefits (or even makes the kernel slower). I can’t find a strategy for decreasing the register usage. I’ve tried fiddling with the order of calculations, avoiding named variables for intermediate results, using loops, unrolling loops. None of them seem to have a predictable effect, I have literally spent hours trying different things, and I still don’t have a good model of how the compiler assigns registers. Do you you guys have any suggestions? Or maybe I am expected too much, and 100% occupancy is too difficult?
/* The block size is 32x8 */
KERNEL void ahd_kernel_interp_g(pixel4* g_horz_res, pixel4* g_vert_res, int width, int height)
{
uint x = blockIdx.x*blockDim.x + threadIdx.x;
uint y = blockIdx.y*blockDim.y + threadIdx.y;
if (x < 2 || y < 2 || x >= width-2 || y >= height-2) {
return;
}
int filter_color = get_filter_color(x,y);
int mulB = filter_color == B;
int mulR = filter_color == R;
int mulG = filter_color == G;
char4 h_res, v_res;
/* Copy existing value to output */
h_res.x = v_res.x = mulB * tex2D(src,x,y);
h_res.y = v_res.y = mulG * tex2D(src,x,y);
h_res.z = v_res.x = mulR * tex2D(src,x,y);
/* Interpolate Green values first */
if (filter_color == R || filter_color == B) {
/* Filter color is red or blue Interpolate green channel horizontally */
/* Use existing green values */
float sum = (tex2D(src,x-1,y) +
tex2D(src,x+1,y))/2.0f;
/* And use existing red/blue values and apply filter 'h' */
sum += (-tex2D(src,x-2,y)/4.0f +
tex2D(src,x, y)/2.0f +
-tex2D(src,x+2,y)/4.0f)/4.0f;
h_res.y = (uchar)clampc(sum);
/* Interpolate green channel vertically */
/* Use existing green values */
sum = (tex2D(src,x,y-1) +
tex2D(src,x,y+1))/2.0f;
/* And use existing red/blue values and apply filter 'h' */
sum += (-tex2D(src,x,y-2)/4.0f +
tex2D(src,x,y )/2.0f +
-tex2D(src,x,y+2)/4.0f)/4.0f;
v_res.y = (uchar)clampc(sum);
}
int res_index = (y*width + x);
g_horz_res[res_index] = h_res;
g_vert_res[res_index] = v_res;
}