barrier and return-ing work-items

The OpenCL 1.1 (1.0 likewise) standard states in Table 6.16 regarding the barrier command:

I know that NVIDIA’s OpenCL implementation will execute code like the following, but I am not really sure if this is covered by the standard. Personally, I’m not quite sure how to interpret “[…] executing the kernel.” (and think it would be more clear to drop that addition completely). One could say that the returning work-items are not executing the kernel anymore and therefore do not have to reach the barrier, but a more restrictive interpretation would find such code illegal (although an implementation may support this beyond the OpenCL standard). I would especially be curious if others find the standard vague in that context, and what NVIDIA’s official interpretation is.

__kernel void k( unsigned height , ... ) {

    if ( get_global_id( 1 ) >= height ) {

        return;

    }

    f();

    barrier( CLK_LOCAL_MEM_FENCE );

    g();

}

The spec is not vague concerning this. If a kernel returns, that will not dissolve this limitation. GPU HW practically do no checking of this sort. Memory access violation, overindexing, memory collisions… All of this goes unchecked by the HW (that’s why GPUs are fast) and if you coded wrong, they will occur without any notice given to you. If a kernel returns in a work-group and others continue execution, the HW will not care about some work-items who have returned, if there were 128 work-items in a work-group, it will wait for 128 hits of a barrier.

To say the least, even if your code works, it works accidently. Specs say if not all work-items hit the barrier in a work-group, it will produce undefined behaviour. Even if it works on an NV GPU, the code will most likely not be portable and will lockup the GPU on AMD, or malfunction on Intel CPU. Going against the specs is unadvised.