local_work_size even divisible?

I have a vector of size n (unknown before the main program is launched), and I want to experiment with the work-group sizes. According to Khronos’ documentation of clEnqueueNDRangeKernel:

It must be evenly divisible? What stops me from doing, say:

__kernel void palt(__global double *fprop, __global const double *fcoll, __global const int *nn)

{

    size_t l = get_global_id(0);

if( l > get_global_size(0) ) return;

	fprop[l] = fcoll[nn[l]];

}

And how do I actually pick a suitable size?

Actually you will code this way:

__kernel void palt(__global double *fprop, __global const double *fcoll, __global const int *nn, const uint max_size)

{

    size_t l = get_global_id(0);

if( l >= max_size ) return;

	fprop[l] = fcoll[nn[l]];

}

With this code you are free to set global worksize to any value.

The only problem with this code is that Intel OpenCL compiler will not be able to auto-vectorize such kernel but that might not be an issue for you.

With this approach the output vector fprop still contains only 0 values. However, if I specify a worksize that divides n exactly, then it works!

More code for reference:

int block_sz_p = 128;

    const int max_size = ns*imax;

// set the parameters for the propagation operator

    errNum = clSetKernelArg(propagation_kernel, 0, sizeof(cl_mem), &fpd);

    errNum |= clSetKernelArg(propagation_kernel, 1, sizeof(cl_mem), &fcd);

    errNum |= clSetKernelArg(propagation_kernel, 2, sizeof(cl_mem), &nnd);

    errNum |= clSetKernelArg(propagation_kernel, 3, sizeof(int), (void *) &max_sz);

    checkErr(errNum, "clSetKernelArg(propagation)");

// specify the work group size/dim

    const size_t work_dim = 3;  

    const size_t global_work_size_propagation[] = {imax*ns, 1, 1};

    const size_t local_work_size_propagation[] = {block_sz_p, 1, 1};

// propagation

    clEnqueueNDRangeKernel(queue, propagation_kernel, work_dim, NULL, 

                           global_work_size_propagation, local_work_size_propagation,

                           0, NULL, &event);

clWaitForEvents(1, &event); 

    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, 

                            sizeof(cl_ulong), &start, NULL);

    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, 

                            sizeof(cl_ulong), &end, NULL);

    tker2 = (end-start);

What’s going on here?

As the specification sais, you must specify a global work size that is a multiple of the work group size. If not, then the kernel won’t launch. Try adding some error checking to your ‘clEnqueueNDRangeKernel’ call.

An example. Say that you have a buffer of ‘n’ elements and you want to launch a work item for each. Next, say that you have some size that you want for each work group. The easy case gives something like this, where ‘e’ denotes a data element.

eeeeeeeeeeeeeeeeeee ..... eeeeeeeeee

 | group  | group  | ..... | group  |

However, there might be a few extra elements in the buffer, breaking the even divisibility.

eeeeeeeeeeeeeeeeeee ..... eeeeeeeeeeeeee

 | group  | group  | ..... | group  |

In this case, it is not legal to simply increment the global size to include the extra elements, since ‘n’ must be divisible by ‘s’. The only thing we can do, save for finding a new work group size that makes it divisible again, is to add another full work group.

eeeeeeeeeeeeeeeeeee ..... eeeeeeeeeeeeee

 | group  | group  | ..... | group  | group  |

Note that some of the work items in the newly added work group won’t have anything to do. We therefor pass in the actual number of elements to the kernel and make sure that these work items does nothing. This is what Maxim Milakov showed with his code snippet, shown again below for reference.

__kernel void palt(__global double *fprop, __global const double *fcoll, __global const int *nn, const uint max_size)

{

    size_t l = get_global_id(0);

    if( l >= max_size )

        return;

    fprop[l] = fcoll[nn[l]];

}

The parameter ‘max_size’ has the value of ‘n’, i.e., the number of elements in the buffer we are working on. Any work items with an index outside of the bound returns immediately.

Thank you for the detailed explanation Martin.

I am, however, still a bit confused, are you not in fact incrementing the global size to make division exact, effectively adding another group, in this example?

If so I did try this approach and its NOT always working, Reordering a vector.

That is correct, we control the number of work group by specifying a global work size. If its not working for some reason then the problem has to be somewhere else. Let’s continue that discussion in the thread you linked to.