OpenCL read/write calls block

There is problem in my code. There seems to be no difference between blocking and non-blocking writes or reads.

I have GTX660M with CUDA driver 5.5, OpenCL 1.1 support, Windows 8 and MVS 2010 (Student licence).

I used in order command queue without events.
In my code I called clEnqueueWriteBuffer/clEnqueueReadBuffer and then clFinish to block. Compiled under Debug mode, without optimalization. Size of buffer is 128 MB. There are results from Nsight (all times are in us):

At first i tried blocking write call and non-blocking read call.

Name                    Start time      Duration
clEnqueueWriteBuffer    35,080.663      26,624.146
clFinish                61,856.713      57.322
clEnqueueReadBuffer	61,933.235      26,340.690
clFinish                88,525.812      52.252

It’s clear that there is nearly no difference between these two.
There are results for non-blocking write and blocking read:

Name			Start time	Duration
clEnqueueWriteBuffer	46,787.322	27,275.274
clFinish		74,257.352	281.714
clEnqueueReadBuffer	74,559.366	26,444.603
clFinish		101,153.985	44.438

Both blocking:

Name			Start time	Duration
clEnqueueWriteBuffer	45,191.364	26,689.431
clFinish		72,143.235	63.251
clEnqueueReadBuffer	72,227.834	26,263.277
clFinish		98,752.183	58.294

Both non-blocking:

Name			Start time	Duration
clEnqueueWriteBuffer	45,542.527	25,854.486
clFinish		71,593.159	269.807
clEnqueueReadBuffer	71,885.008	26,803.755
clFinish		98,843.884	44.737

There is timings with second GPU from Intel (Intel HD Graphics 4000) on the same program:

Name			Start time	Duration
clEnqueueWriteBuffer	2,713,699.366	1,820.125
clFinish		2,715,695.992	29,031.327
clEnqueueReadBuffer	2,744,778.488	1,041.781
clFinish		2,745,998.439	22,486.324

The question is: Is there something wrong in my code, (some missing argument or initialization step) or is it driver feature/bug? Or it’s mobile card restriction?

Also I noticed same behaviour as Intel GPU with ATI and Intel CPU (Core i7).
This is not major issue, but having 29 ms to do something else is much better than 0.27 ms. I’m not going to do multithread app only to save few ms.
This is issue because copy takes 8 ms on GTX660M (which means that I can do nearly 4 kernels with one read+write on same size during same time) so I can generate something (for example twiddle factors for fft, z-chirp signal, etc.) at the same time.