barriers and early return

Note: I’m doing this in opencl but I imagine same applies with CUDA.

I’m doing imaging processing, and in case the bordering thread blocks on the right/bottom borders go out of bounds of the image, I start each kernel with something like this:

int gx = get_global_id(0);
int gy = get_global_id(1);
if(gx >= imageSize.x || gy >= imageSize.y)
        return;

For a kernel I am working on, it took me a long time to figure out this early return was the culprit of basically the right/bottom bordering thread blocks not getting executed. Inside my kernel I had some code like this:

float minDist = FLT_MAX;
int blockIndex = 0;
while(blockIndex < verticalGroupCount)
{
    // [...] Read block amount of column data into shared mem
    barrier(CLK_LOCAL_MEM_FENCE);

    // [...] Find min over shared mem
    barrier(CLK_LOCAL_MEM_FENCE); // SIDE NOTE: Is this barrier needed.  I only read from shared mem in this
                                  // second part of the loop.  I don't write to it until the beginning of the 
                                  // loop.

    blockIndex++;
}

If I remove the early return and instead put boundary guards in the kernel (e.g., when reading from global memory to put in shared memory) and guard the write at the end:

if(gx < imageSize.x && gy < imageSize.y)
    write_imagef(verticalDT, (int2)(gx, gy), (float4)(minDist, 0.0f, 0.0f, 0.0f));

it works.

I just want to be clear why it works. Does the early return basically make it impossible for all threads to reach the barrier point? And so I am getting undefined behavior? Which in my case looks like all the threads in the group return early.

Precisely.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

https://stackoverflow.com/questions/16928885/opencl-median-image-filter-using-local-work-groups/16932326#16932326
https://stackoverflow.com/questions/28563118/shall-i-return-if-the-global-id-is-above-the-number-of-elements-in-opencl/28563236#28563236