memory transaction size for compute capability 1.2 or later

Hi, all,
From “NVIDIA CUDA Programming Guide 2.0” Section
“Coalescing on Devices with Compute Capability 1.2 and Higher”

“Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for
16-bit data, 128 bytes for 32-, 64- and 128-bit data.”

I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While “Guide” says “32 bytes for 8-bit data”. It seems half bandwidth is wasted. Am I understanding correctly?


Read on. Later the Programming guide says that the transaction size will be reduced to half if only the lower or the upper half is used. So as long as the threads of a half-warp access consecutive bytes that are properly aligned, no bandwidth is wasted.

I think this is not true for CC 2.x – the transaction size is always 128 bytes – as per Appendix F, Figure F-1., in “CUDA C Programming Guide Version 4.0”. Also for 1.x, the mimimum transation size is 32 bytes.