What is the align requirement to use “coalesced” way of accesing global memory?
From Figure 5.1 from Programming Guide I’d say it must be 16, 32, 64 or 128. Is this sizeof(VAR) dependent? Please enlighten me :)
Ok, Programming Guide says:
"Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes."
It also mentions 2nd out of 3 conditions at “Coalescing on Devices with Compute Capability 1.0 and 1.1”:
"All 16 words must lie in the same segment of size equal to the memory transaction size [...]"
If I understand this correctly, when I cudaMalloc let’s say 64 bytes of global memory to store 16 floats, the address is aligned to at least 256 bytes (so it is also aligned to 32 bytes) and if succesive threads from the half-warp access succesive 32-bit words, then the whole 64 bytes transaction is done as one, fast transaction, instead of slow, serialized access. Am I right here?
Yes.