Block edges updates


I am working on an algorithm where each pixel needs its four neighbors.

So for each 1616 block of threads, I want to copy the global memory into 1818 shared memory blocks (This way pixels on block edges will have correct neighbors) :

-> [z z z z z]

[x x x] -> [z x x x z]

[x x x] -> [z x x x z]

[x x x] -> [z x x x z]

        -> [z z z z z]

This is my simple code:

__kernel void test(__global float* u, __global float* v,  __local volatile float* uLocal)


    // Block index

    int bi = get_group_id(1);

    int bj = get_group_id(0);

// Local coordinates

    int li = get_local_id(1)+1;

    int lj = get_local_id(0)+1;

// Global coordinates

   int gi = get_global_id(1);

   int gj = get_global_id(0);

// Local height and width

   int lHeight = get_local_size(1)+2;

   int lWidth  = get_local_size(0)+2;

// Global height and width

   int gHeight = get_global_size(1);

   int gWidth  = get_global_size(0);

if( (gi-1<0) || (gi+1>gHeight-1) || (gj-1<0) || (gj+1>gWidth-1) )


uLocal[li*lWidth+lj] = u[gi*gWidth+gj];

for(int i=0 ; i<N ; i++)


      // put edges in local memory

      if(li-1 == 0)

         uLocal[(li-1)*lWidth+lj] = u[(gi-1)*gWidth+gj];

      else if(li+1 == lHeight-1)

         uLocal[(li+1)*lWidth+lj] = u[(gi+1)*gWidth+gj];

if(lj-1 == 0)

         uLocal[li*lWidth+lj-1] = u[gi*gWidth+gj-1];

      else if(lj+1 == lWidth-1)

         uLocal[li*lWidth+lj+1] = u[gi*gWidth+gj+1];

//do something with uLocal...

      uLocal[li*lWidth+lj] = 0.5*uLocal[li*lWidth+lj+1] - 0.5*uLocal[li*lWidth+lj-1];


v[gi*gWidth+gj] = uLocal[li*lWidth+lj];




My problem concerns the for loop of my first program:

How to synchronize threads when I must update edges (the threads on blocks’ edges will have more work) in the for loop. The barrier() and if() else() statements are not compatible and the whole program is skipped when threads are not synchronized (the GPU does nothing, the execution time is almost 0).

Maybe I must remove the for loop from the kernel and put the kernel into a for loop, but I guess that would require many clEnqueueRead/WirteBuffer and slow down the algorithm.

Thanks for helping !