It appears that NVidia’s OpenCL implementation is still[1] blocking on calls to clEnqueueNDRangeKernel or possibly on clEnqueueWriteBuffer / clEnqueueReadBuffer.
You can observe this behavior by creating a high-iteration loop that queues a buffer write, followed by a kernel call, followed by a buffer read. Make sure to use the OpenCL wait list arguments and events to make each command wait for the previous one. Outside of the loop (and after the loop), use OpenCL’s wait function to wait for the last read event to complete.
for i = 0; i < LARGE_NUMBER; i++ {
queue a write, waiting on the last read unless i = 0
queue a kernel, waiting on the above write
queue a read, waiting on the above kernel
}
wait on the last read
When I do this, I see the loop that does the queuing take a very long time, and the final wait completes instantly. This lends evidence to the claim that NVidia’s OpenCL implementation is blocking on reads, writes, or kernel launches.
For further evidence, install AMD APP[2] and attempt to run the same code on a CPU (you can install AMD APP without an AMD card). You’ll see the queue loop finish almost instantly, and the program will wait on the wait.
I suppose the other alternative is that NVidia’s queueing algorithm/implementation is slow, the GPU is actually beating it, even for fairly substantial kernels. This seems like it would be a bug in and of itself. However, I doubt this is the case, because if I increase the workload of each kernel, the time to run the queuing loop also increases.
It really ought not matter what hardware I’m running, but just in case it does, I ‘m testing on a Quadro K5000 and a Intel Xeon X5482.
Note that I’m not attempting to run multiple kernels concurrently, merely to queue them asynchronously. Being able to queue kernels asynchronously would allow for a (concurrency-enabled) device to run concurrent kernels with only a single host thread. Which would be, you know, a billion times better than having to fiddle around with synchronizing an apparently-non-conforming OpenCL implementation.
Has anyone else observed this behavior? Is it a problem with my code, or with NVidia’s OpenCL implementation? Have I provided sufficient information?
- https://devtalk.nvidia.com/default/topic/415023/launch-kernels-in-parallel-/
- http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/
P.S. I sure am glad I write my forum posts in Emacs and not the browser! I forgot a subject line, and after submitting, my post was no longer in the box!
Edit: I forgot to ask a question.