There is problem in my code. There seems to be no difference between blocking and non-blocking writes or reads.
I have GTX660M with CUDA driver 5.5, OpenCL 1.1 support, Windows 8 and MVS 2010 (Student licence).
I used in order command queue without events.
In my code I called clEnqueueWriteBuffer/clEnqueueReadBuffer and then clFinish to block. Compiled under Debug mode, without optimalization. Size of buffer is 128 MB. There are results from Nsight (all times are in us):
At first i tried blocking write call and non-blocking read call.
Name Start time Duration
clEnqueueWriteBuffer 35,080.663 26,624.146
clFinish 61,856.713 57.322
clEnqueueReadBuffer 61,933.235 26,340.690
clFinish 88,525.812 52.252
It’s clear that there is nearly no difference between these two.
There are results for non-blocking write and blocking read:
Name Start time Duration
clEnqueueWriteBuffer 46,787.322 27,275.274
clFinish 74,257.352 281.714
clEnqueueReadBuffer 74,559.366 26,444.603
clFinish 101,153.985 44.438
Both blocking:
Name Start time Duration
clEnqueueWriteBuffer 45,191.364 26,689.431
clFinish 72,143.235 63.251
clEnqueueReadBuffer 72,227.834 26,263.277
clFinish 98,752.183 58.294
Both non-blocking:
Name Start time Duration
clEnqueueWriteBuffer 45,542.527 25,854.486
clFinish 71,593.159 269.807
clEnqueueReadBuffer 71,885.008 26,803.755
clFinish 98,843.884 44.737
There is timings with second GPU from Intel (Intel HD Graphics 4000) on the same program:
Name Start time Duration
clEnqueueWriteBuffer 2,713,699.366 1,820.125
clFinish 2,715,695.992 29,031.327
clEnqueueReadBuffer 2,744,778.488 1,041.781
clFinish 2,745,998.439 22,486.324
The question is: Is there something wrong in my code, (some missing argument or initialization step) or is it driver feature/bug? Or it’s mobile card restriction?
Also I noticed same behaviour as Intel GPU with ATI and Intel CPU (Core i7).
This is not major issue, but having 29 ms to do something else is much better than 0.27 ms. I’m not going to do multithread app only to save few ms.
This is issue because copy takes 8 ms on GTX660M (which means that I can do nearly 4 kernels with one read+write on same size during same time) so I can generate something (for example twiddle factors for fft, z-chirp signal, etc.) at the same time.