2nd Output Array gives garbage! Use kernel function to perform SAD. Works for 1st output array..


I am a newbie to OpenCL. I have been tasked to do some image processing stuff.
Anyways, I am passing 2 sets of YUV data (left and right images of size 320 x 168) to a kernel function, which will compute the gradient of each pixel using SAD (Sum of absolute differences). For my first output, I only use 1 set of YUV data and it works nicely with the values same as computed by the CPU.

However, when I try to add a 2nd set of YUV data to the kernel function and compute for the 2nd output array, nothing seems to work (on the 2nd output array). I tried hard coding all members to 5 but the output array still shows values of its own.

Here is the kernel function implementation (output arrays are the first 2 parameters):
const char grad_l_h_cl = "
__kernel void grad_l_h
__global unsigned char* img_grad_left_hor
, __global unsigned char* img_grad_right_hor
, __global unsigned char* p1_y
, __global unsigned char* p1_u
, __global unsigned char* p1_v
, __global unsigned char* p2_y
, __global unsigned char* p2_u
, __global unsigned char* p2_v
, int width
, int height
const uint index = get_global_id(0);
unsigned char diff_y = p2_y[index]-p2_y[index+1], diff_u = 0, diff_v = 0;

if (index % width == width - 1){
img_grad_left_hor[index] = abs_diff(p1_y[index-1],p1_y[index]) + abs_diff(p1_u[index-1],p1_u[index])+ abs_diff(p1_v[index-1],p1_v[index]);
img_grad_right_hor[index] = abs_diff(p2_y[index-1],p2_y[index]) + abs_diff(p2_u[index-1],p2_u[index])+ abs_diff(p2_v[index-1],p2_v[index]);
} else {
img_grad_left_hor[index] = abs_diff(p1_y[index],p1_y[index+1]) + abs_diff(p1_u[index],p1_u[index+1])+ abs_diff(p1_v[index],p1_v[index+1]);
img_grad_right_hor[index] = abs_diff(p2_y[index],p2_y[index+1]) + abs_diff(p2_u[index],p2_u[index+1])+ abs_diff(p2_v[index],p2_v[index+1]);
if (img_grad_left_hor[index] == 0) {
img_grad_left_hor[index] = 0;
if (img_grad_left_hor[index] > 255) {
img_grad_left_hor[index] = 255;

Here is how I perform the operation (g_worksize = 8 x 320 x 168, l_worksize = 256):
error=clEnqueueNDRangeKernel(cq, k_cfg, 1, NULL, &g_worksize, &l_worksize, 0, NULL, NULL);

I have created input buffers as such (work = 320 x 168):
memp1_u=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);

I suspect some memory settings are required to hold the data in the 2nd output array but I have no idea how to do so.

Please kindly help or advise.

*My system is:
GT220 - 6 multiprocessors, 48 CUDA cores, Compute Capability 1.2
GPU Computing SDK 3.2
WinXP Pro