Global memory bandwidth profiling?

I don’t understand why in my WebCL tests float4 copy is 3 times faster than float one on GTX?
http://www.ibiblio.org/e-notes/webcl/gtx560.html

Evgeny

“Global memory is accessed via 32-, 64-, or 128-byte memory transactions”. For (8×8) thread blocks a warp reads 8×4 bytes = 32 bytes in one float transaction but for (32×8) thread blocks it reads 32×4 bytes = 128 bytes (as like as in float4 transactions for (8×8) thread blocks). 128-byte transactions are ~3-4 times faster on GTX than-32 byte ones.