async_work_group_copy doesn't run

Hi,
I am trying to do some operations on two double arrays copying in local the subpart of the first array used for these operations.
I would like to use async_work_group_copy but it doesn’t run correctly.
Here it is the kernel code and how I pass the arguments
CPU - Host
cl_int clStatus = clSetKernelArg (kernel, 0, sizeof(cl_mem), &m_pMemFirstArray);
clStatus |= clSetKernelArg (kernel, 1, sizeof(int), &m_sizeOfSecondArray);
clStatus |= clSetKernelArg (kernel, 2, sizeof(cl_mem), &m_pMemSecondArray);
clStatus |= clSetKernelArg (kernel, 3, sizeof(cl_double)*(m_NumberOfEls), NULL);
checkCLStatus(clStatus);
//Execute kernel
clStatus = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, &sizeOfFirstArray, 0, 0, NULL, NULL);
checkCLStatus(clStatus);
waitCompletion();

GPU - device
_kernel void evaluate_product(__global const double *pFirstArray, const int n,
__global const double *pSecondArray, __local double pLocal,…)
{
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid) {
event_t event = (event_t)0;
//the elements I would copy in local are a contiguous part of n elements in pFirstArray
//starting from each gid
event = async_work_group_copy(pLocal,&pFirstArray[n
gid], (size_t)(n),event);
wait_group_events (1, &event);
double output = 0.0f;
for (int k=0; k<n; k++)
output += pLocal[k]*pSecondArray[k];

}
}

Where is the error?

And what is the outcome of Your code? Does it compile?

It compiles and runs without errors. The outcome isn’t correct; if I use a simple for loop and a direct access to the global memory buffer it runs correctly.
Maybe there is some errors in the way I use async_work_group_copy?

Two things (comments):

__kernel void evaluate_product(__global const double *pFirstArray, const int n, __global const double *pSecondArray, __local double *pLocal,....)
{
	int gid = get_global_id(0);
	int size = get_global_size(0); 
	
	if (gid >= 0 && gid) {	// mutually exclusive conditions?!
		event_t event = (event_t)0; 
		//the elements I would copy in local are a contiguous part of n elements in pFirstArray
		//starting from each gid
		
		event = async_work_group_copy(pLocal, pFirstArray+n*gid, (size_t)(n), event);	// strided access?
		wait_group_events (1, &event);
		double output = 0.0f;
		for (int k=0; k<n; k++) output += pLocal[k]*pSecondArray[k]; 
		... 
	} 
}

Hi,
sorry the first was a mistake during the copy of the code :) (the original is gid < size).
What do you mean for “strided access”? My original kernel did a simple dot product between the two global arrays pFirstArray and pSecondArray. So, in order to increase its performance, I tought to copy in “local memory” the subparts of pFirstArray used for the dor product. But is isn’t correct because I want that each work item had different subparts of pFirstArray and not to share the same subpart with other work items of the same work group. Above all, I was trying to do that because the original (and simple) kernel took 30ms on my NVIDIA GTX 260 video card while on ATI Radeon HD 6900 it took less then 10ms. Why this difference? It depends on using double variables?
Any ideas?
Tks

Hello,
What I meant was that You use (so it seems by the code) a strided access pattern in the place of ‘async_work_group_copy’ call. There is a function ‘async_work_group_strided_copy’ that may be more suitable here.

MK