There is problem in my code. There seems to be no difference between blocking and non-blocking writes or reads.
I have GTX660M with CUDA driver 5.5, OpenCL 1.1 support, Windows 8 and MVS 2010 (Student licence).
I used in order command queue without events.
In my code I called clEnqueueWriteBuffer/clEnqueueReadBuffer and then clFinish to block. Compiled under Debug mode, without optimalization. Size of buffer is 128 MB. There are results from Nsight (all times are in us):
At first i tried blocking write call and non-blocking read call.
Name Start time Duration clEnqueueWriteBuffer 35,080.663 26,624.146 clFinish 61,856.713 57.322 clEnqueueReadBuffer 61,933.235 26,340.690 clFinish 88,525.812 52.252
It’s clear that there is nearly no difference between these two.
There are results for non-blocking write and blocking read:
Name Start time Duration clEnqueueWriteBuffer 46,787.322 27,275.274 clFinish 74,257.352 281.714 clEnqueueReadBuffer 74,559.366 26,444.603 clFinish 101,153.985 44.438
Name Start time Duration clEnqueueWriteBuffer 45,191.364 26,689.431 clFinish 72,143.235 63.251 clEnqueueReadBuffer 72,227.834 26,263.277 clFinish 98,752.183 58.294
Name Start time Duration clEnqueueWriteBuffer 45,542.527 25,854.486 clFinish 71,593.159 269.807 clEnqueueReadBuffer 71,885.008 26,803.755 clFinish 98,843.884 44.737
There is timings with second GPU from Intel (Intel HD Graphics 4000) on the same program:
Name Start time Duration clEnqueueWriteBuffer 2,713,699.366 1,820.125 clFinish 2,715,695.992 29,031.327 clEnqueueReadBuffer 2,744,778.488 1,041.781 clFinish 2,745,998.439 22,486.324
The question is: Is there something wrong in my code, (some missing argument or initialization step) or is it driver feature/bug? Or it’s mobile card restriction?
Also I noticed same behaviour as Intel GPU with ATI and Intel CPU (Core i7).
This is not major issue, but having 29 ms to do something else is much better than 0.27 ms. I’m not going to do multithread app only to save few ms.
This is issue because copy takes 8 ms on GTX660M (which means that I can do nearly 4 kernels with one read+write on same size during same time) so I can generate something (for example twiddle factors for fft, z-chirp signal, etc.) at the same time.