memory sync problem writing to global memory within kernel iterations

I’m working on an iterative Algorithm, where each thread calculates a new value using values written by other threads and writes to global memory at each iterations.

As there is no global sync available (as far as i know) I’m trying to do the interations within one work-group, setting local-work-size and global-work-size to 512.

So I only use one of the available compute units.

void

__kernel kernel_calc(__global float * c, 	//vector

			int iter,		//iterations

			int dim)		//vector dimension

{

	int gx = get_global_id(0);

				

	float tmp = 0;

	for(int i = 0; i < iter; i++)

	{

		if(gx == 0)

			tmp = c[gx] + 0.1 * (c[gx+1] - 2 * c[gx]);

		else if(gx == dim-1)

			tmp = c[gx] + 0.1 * (c[gx-1] - 2 * c[gx]);

		else

			tmp = c[gx] + 0.1 * (c[gx-1] - 2 * c[gx] + c[gx+1]);

		barrier(CLK_LOCAL_MEM_FENCE);

		c[gx] = tmp;

		barrier(CLK_GLOBAL_MEM_FENCE);

	}

}

As initial values I fill up c Vector with zeros except first and last entry set to 50000.

Hence resulting values should be symetric (c[0] == c[dim-1], c[1] == c[dim-2] …).

Running the algorithm on CL_DEVICE_TYPE_CPU (ATI Stream 2.0.1) everything is fine, but running on

GPU (Nvidia 3.0.1 on 9400GT) the results are wrong.

Can there be some synchronization error?

Is there any possibilty to do a global synchronization (within kernel without iterating over clEnqueueNdRange(…))?

Why do you use CLK_LOCAL_MEM_FENCE in your first barrier? It’s for local memory (i.e. a on chip memory shared among work items in a work group), you don’t use them in your kernel. I think you should use CLK_GLOBAL_MEM_FENCE on both barrier.

However, mind that a global memory sync is very slow on current NVIDIA GPU. You should consider loading data to local memory and perform local syncs, which are much faster.

Why using CLK_GLOBAL_MEM_FENCE as first barrier?

I’m writing to tmp which is a private variable, therefore i thought CLK_LOCAL_MEM_FENCE should be used.

tmp is a private variable, not a local variable. CLK_LOCAL_MEM_FENCE is for local variable (shared memory in CUDA).

If you use double buffer you can reduce the number of global memory barriers per iteration from two to one. For example:

for(i = 0; i < iter; i += 2) {

  d[gx] = c[gx] + ...

  barrier

  c[gx] = d[gx] + ...

  barrier

}

double buffer is a very good idea, I’ll try it.