How to reduce CPU usage while waiting for kernels to finish in OpenCL on nvidia gpu?

I’m working on a cryptocurrency mining implementation in OpenCL and having trouble getting it to play nice with the Nvidia OpenCL driver. The problem is that the NVIDIA driver waits for kernels to finish using spinlocks. When you have 6-20 GPUs hanging off a single dual core CPU, this causes the entire system to grind to a halt.

I’ve tried many solutions, but they all seem to break on windows, while working nicely on linux:

a) Time the kernels and sleep the thread for ~95% of the time it takes to execute, to reduce spinlock to 5% of total kernel time. This fails because putting the thread to sleep immediately after enqueuing the kernel causes it to not start executing until after the sleep call. I then tried adding a clFlush() call to ensure the kernel would start, but the clFlush() call seems to block until the kernel completes, negating the entire purpose.

b) Use a non-blocking readbuffer call, followed by polling the event status and sleeping in a loop. This fails because the clEnqueueReadBuffer() call seems to completely ignore the boolean block flag, always waiting for the kernel to be done before continuing.

Ideas?

clEnqueueNDRangeKernel allows you to specify an event that will indicate completion of the kernel being enqueued. Have you tried polling that event?

I had not tried that. It doesn’t seem to help unfortunately. The status stays at “Queued” until clEnqueueReadBuffer is called, which then waits for the kernel to complete (with or without the blocking flag set). After the readbuffer call, the kernel has status complete (which is true, but not particularly useful).

Ok, I have solved the problem. The openCL bindings I was using had a bug, where both Flushing and Finishing a queue called clFinish() under the hood! Fixed that, and then implemented method a) and it’s working well!