nVidia OpenCL runtime implementation treats async buffer read as sync one

Can’t find separate forum for OpenCL so create topic here, in CUDA area.

I’m trying to avoid 100% CPU load when using NV GPUs through OpenCL. So, I’m enqueue as much kernels as possible then enqueue async buffer read and then try to check read buffer event status in loop sleeping 1ms or so after each check.

But this check always finishes after first iteration. So, quite long kernels chain + memory transfers performed for less than 1ms (highly impossible even for fastest NV GPUs)… or async buffer read actually returns only when full queue completed, i.e. it;'s not async but blocking buffer read.

Well, I checked this possibility and found that if I check event relating not to buffer read but to last kernel in sequence loop iterates many times (as should be, GPU can’t do all work instantly).
So,
clEnqueueReadBuffer(cq, …,CL_FALSE,…); is not asynchronous but blocking call (!).
How so? How nVidia could implement OpenCL runtime SO badly?
Maybe I miss something ?
Any comments, please?

You can avoid 100% CPU usage by using clWaitForEvents to wait for the read to finish.

Regarding whether clEnqueueReadBuffer blocks or doesn’t, a minimum working example showing this behaviour would help.

clWaitForEvents() doesn’t help because it also busy waits when using NVIDIA’s OpenCL implementation/drivers.

In fact the only way to resolve this bug is by sleeping for some precalculated time after the kernel is queued and flushed. This will decrease the CPU load down to less than 10% with almost no impact on OpenCL performance.