As the specification sais, you must specify a global work size that is a multiple of the work group size. If not, then the kernel won’t launch. Try adding some error checking to your ‘clEnqueueNDRangeKernel’ call.
An example. Say that you have a buffer of ‘n’ elements and you want to launch a work item for each. Next, say that you have some size that you want for each work group. The easy case gives something like this, where ‘e’ denotes a data element.
eeeeeeeeeeeeeeeeeee ..... eeeeeeeeee
| group | group | ..... | group |
However, there might be a few extra elements in the buffer, breaking the even divisibility.
eeeeeeeeeeeeeeeeeee ..... eeeeeeeeeeeeee
| group | group | ..... | group |
In this case, it is not legal to simply increment the global size to include the extra elements, since ‘n’ must be divisible by ‘s’. The only thing we can do, save for finding a new work group size that makes it divisible again, is to add another full work group.
eeeeeeeeeeeeeeeeeee ..... eeeeeeeeeeeeee
| group | group | ..... | group | group |
Note that some of the work items in the newly added work group won’t have anything to do. We therefor pass in the actual number of elements to the kernel and make sure that these work items does nothing. This is what Maxim Milakov showed with his code snippet, shown again below for reference.
__kernel void palt(__global double *fprop, __global const double *fcoll, __global const int *nn, const uint max_size)
{
size_t l = get_global_id(0);
if( l >= max_size )
return;
fprop[l] = fcoll[nn[l]];
}
The parameter ‘max_size’ has the value of ‘n’, i.e., the number of elements in the buffer we are working on. Any work items with an index outside of the bound returns immediately.