I’m working on an iterative Algorithm, where each thread calculates a new value using values written by other threads and writes to global memory at each iterations.
As there is no global sync available (as far as i know) I’m trying to do the interations within one work-group, setting local-work-size and global-work-size to 512.
So I only use one of the available compute units.
void
__kernel kernel_calc(__global float * c, //vector
int iter, //iterations
int dim) //vector dimension
{
int gx = get_global_id(0);
float tmp = 0;
for(int i = 0; i < iter; i++)
{
if(gx == 0)
tmp = c[gx] + 0.1 * (c[gx+1] - 2 * c[gx]);
else if(gx == dim-1)
tmp = c[gx] + 0.1 * (c[gx-1] - 2 * c[gx]);
else
tmp = c[gx] + 0.1 * (c[gx-1] - 2 * c[gx] + c[gx+1]);
barrier(CLK_LOCAL_MEM_FENCE);
c[gx] = tmp;
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
As initial values I fill up c Vector with zeros except first and last entry set to 50000.
Hence resulting values should be symetric (c[0] == c[dim-1], c[1] == c[dim-2] …).
Running the algorithm on CL_DEVICE_TYPE_CPU (ATI Stream 2.0.1) everything is fine, but running on
GPU (Nvidia 3.0.1 on 9400GT) the results are wrong.
Can there be some synchronization error?
Is there any possibilty to do a global synchronization (within kernel without iterating over clEnqueueNdRange(…))?