Concurrent Kernel Execution

I apologized if this is an old question, I wasn’t able to find the answer in the first few pages of this forum or using the search function or the very limited and horribly written OpenCL programming guide.

How do you implement concurrent kernel execution in OpenCL?

I don’t think this is possible in OpenCL at the moment

The programming guide says it is possible:

For devices of compute capability 2.0, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using queues to enable enough kernels to execute concurrently.

It mentions using queues, but has no example code or further explanation on how to achieve this, do you just put the kernels into different command queues and then call them without any event of flush between the two, for example?

The Nvidia OpenCL programming guide is very vague on this matter, as it is on quite a few things, I’m not sure if that’s intentional or not.

I think the key is the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of the command queue. See section 5.8 Out-of-order Execution of Kernels and Memory Object Commands of the OpenCL 1.0 specs (in the OpenCL 1.1 specs it is 5.11). You should also be able to create multiple (possibly in-order) command queues for the same device, which would at least allow the driver to schedule them concurrently. CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported on compute capability 1.x cards (which can process a single kernel call at a time only, some of them with overlapping DMA transfers).

Regards,
Markus

The OpenCL API gives you two options, although I’ve not experimented with the NV drivers yet so I don’t know if they’ll both actually give parallel execution:

  1. Use an out-of-order queue (pass CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the queue). This will cause all commands in that queue to be potentially parallelised, so you’ll need to use events to enforce any required ordering (e.g. loading of data before processing that data, or dependent kernels).

  2. Use multiple queues and put a command into each queue. Each queue can be scheduled independently, except where events are used to constrain ordering. It should be okay to clFlush the queues whenever, since that doesn’t block on completion (unlike clFinish).

Ok, thanks everyone, I’ll give this a try and let you know how it turns out.

When I try the second option, the profiler returns a really long run time for the first kernel (of two kernels), unless I clFinish after each kernel (which would prohibit CKE I would imagine, so that kind of defeats the purpose), so my point is that I’m not sure this works either.