in the CUDA Programming Guide 2.0, 188.8.131.52, under the topic “Coalescing on Devices with Compute Capability 1.2 and Higher” they say that 128 bytes are transfered in one memory transaction if the segment size of all threads of a halfwarp access 32-bit or 64-bit words. I mean 32-bit * 16 threads (because of a halfwarp) = 512 bits = 128 bytes. But 64 bit-words would need 64 * 16 / 4 = 256 byte transfer…
Are 64-bit words casted into 32-bit ones?
Thx in advance.