Loading global memory into shared memory: alignment?

So I have a load of threads loading 4 byte values from device global memory into shared memory, and I am wondering what alignment I need?

In the separable convolution example document it says:
“Base read/write addresses of the warps of 32 threads also must meet half-warp
alignment requirement in order to be coalesced. If four-byte values are read, then the base
address for the warp must be 64-byte aligned, and threads within the warp must read
sequential 4-byte addresses. If the dataset with apron does not align in this way, then we
must fix it so that it does.”

but in the cuda toolkit documentation it says:

“Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.”

Which is right?
Please help…

The stricter coalescing requirement requirement for half warps applied to the (now obsolete) Compute 1.x architecture. Maybe this code sample you’re quoting originated from these old times.

I believe with Compute 2.0 (Fermi) the coalescing requirements got relaxed. While memory accesses must still be consecutive, the alignment requirement no longer is that strict.

Ahhh. That would make sense. Thanks! I’m quite new to CUDA and the varying compute capabilities make it really hard to learn, as there is slightly conflicting advice in different places and you are never quite sure which compute capability they are referring to. Many thanks!