So I have a load of threads loading 4 byte values from device global memory into shared memory, and I am wondering what alignment I need?
In the separable convolution example document it says:
“Base read/write addresses of the warps of 32 threads also must meet half-warp
alignment requirement in order to be coalesced. If four-byte values are read, then the base
address for the warp must be 64-byte aligned, and threads within the warp must read
sequential 4-byte addresses. If the dataset with apron does not align in this way, then we
must fix it so that it does.”
but in the cuda toolkit documentation it says:
“Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.”
Which is right?
Please help…