barriers and early return

Note: I’m doing this in opencl but I imagine same applies with CUDA.

I’m doing imaging processing, and in case the bordering thread blocks on the right/bottom borders go out of bounds of the image, I start each kernel with something like this:

int gx = get_global_id(0);
int gy = get_global_id(1);
if(gx >= imageSize.x || gy >= imageSize.y)

For a kernel I am working on, it took me a long time to figure out this early return was the culprit of basically the right/bottom bordering thread blocks not getting executed. Inside my kernel I had some code like this:

float minDist = FLT_MAX;
int blockIndex = 0;
while(blockIndex < verticalGroupCount)
    // [...] Read block amount of column data into shared mem

    // [...] Find min over shared mem
    barrier(CLK_LOCAL_MEM_FENCE); // SIDE NOTE: Is this barrier needed.  I only read from shared mem in this
                                  // second part of the loop.  I don't write to it until the beginning of the 
                                  // loop.


If I remove the early return and instead put boundary guards in the kernel (e.g., when reading from global memory to put in shared memory) and guard the write at the end:

if(gx < imageSize.x && gy < imageSize.y)
    write_imagef(verticalDT, (int2)(gx, gy), (float4)(minDist, 0.0f, 0.0f, 0.0f));

it works.

I just want to be clear why it works. Does the early return basically make it impossible for all threads to reach the barrier point? And so I am getting undefined behavior? Which in my case looks like all the threads in the group return early.