Coalescing on Devices with Compute Capability 1.2

Hi,

in the CUDA Programming Guide 2.0, 5.1.2.1, under the topic “Coalescing on Devices with Compute Capability 1.2 and Higher” they say that 128 bytes are transfered in one memory transaction if the segment size of all threads of a halfwarp access 32-bit or 64-bit words. I mean 32-bit * 16 threads (because of a halfwarp) = 512 bits = 128 bytes. But 64 bit-words would need 64 * 16 / 4 = 256 byte transfer…
Are 64-bit words casted into 32-bit ones?

Thx in advance.

There are 8 bits in a byte:
(float) 32 bits * 16 / 8 bits/byte = 64 bytes
(float2) 64 bits * 16 / 8 bits/byte = 128 bytes
(float4) 128 bits * 16 / 8 bits/byte = 256 bytes

So, your question still holds but it should have been what happens to a float4 read? It seems that float4 reads are not coalesced on compute 1.2 hardware. Although, they will only generate 2 128 byte threads so performance will not suffer (and it doesn’t in my testing)