Can’t find separate forum for OpenCL so create topic here, in CUDA area.
I’m trying to avoid 100% CPU load when using NV GPUs through OpenCL. So, I’m enqueue as much kernels as possible then enqueue async buffer read and then try to check read buffer event status in loop sleeping 1ms or so after each check.
But this check always finishes after first iteration. So, quite long kernels chain + memory transfers performed for less than 1ms (highly impossible even for fastest NV GPUs)… or async buffer read actually returns only when full queue completed, i.e. it;'s not async but blocking buffer read.
Well, I checked this possibility and found that if I check event relating not to buffer read but to last kernel in sequence loop iterates many times (as should be, GPU can’t do all work instantly).
So,
clEnqueueReadBuffer(cq, …,CL_FALSE,…); is not asynchronous but blocking call (!).
How so? How nVidia could implement OpenCL runtime SO badly?
Maybe I miss something ?
Any comments, please?