Global memory bandwidth profiling?

enot · November 1, 2011, 11:56am

I don’t understand why in my WebCL tests float4 copy is 3 times faster than float one on GTX?
http://www.ibiblio.org/e-notes/webcl/gtx560.html

Evgeny

enot · November 14, 2011, 5:44am

“Global memory is accessed via 32-, 64-, or 128-byte memory transactions”. For (8Ã—8) thread blocks a warp reads 8Ã—4 bytes = 32 bytes in one float transaction but for (32Ã—8) thread blocks it reads 32Ã—4 bytes = 128 bytes (as like as in float4 transactions for (8Ã—8) thread blocks). 128-byte transactions are ~3-4 times faster on GTX than-32 byte ones.