Loading global memory into shared memory: alignment?

dave4ff8k · December 8, 2017, 2:00pm

So I have a load of threads loading 4 byte values from device global memory into shared memory, and I am wondering what alignment I need?

In the separable convolution example document it says:
“Base read/write addresses of the warps of 32 threads also must meet half-warp
alignment requirement in order to be coalesced. If four-byte values are read, then the base
address for the warp must be 64-byte aligned, and threads within the warp must read
sequential 4-byte addresses. If the dataset with apron does not align in this way, then we
must fix it so that it does.”

but in the cuda toolkit documentation it says:

“Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.”

Which is right?
Please help…

cbuchner1 · December 8, 2017, 2:08pm

The stricter coalescing requirement requirement for half warps applied to the (now obsolete) Compute 1.x architecture. Maybe this code sample you’re quoting originated from these old times.

I believe with Compute 2.0 (Fermi) the coalescing requirements got relaxed. While memory accesses must still be consecutive, the alignment requirement no longer is that strict.

dave4ff8k · December 8, 2017, 2:09pm

Ahhh. That would make sense. Thanks! I’m quite new to CUDA and the varying compute capabilities make it really hard to learn, as there is slightly conflicting advice in different places and you are never quite sure which compute capability they are referring to. Many thanks!

Topic		Replies	Views
Global memory alignment and coalescing CUDA 1.1 compatible CUDA Programming and Performance	2	1749	October 20, 2008
Require clarification for Memory coalescing? CUDA Programming and Performance hw , cuda	4	2387	October 5, 2023
256B aligned address in global memory? CUDA Programming and Performance	1	6641	April 19, 2011
coalesced read short integer cuda CUDA Programming and Performance	5	1456	October 21, 2010
Misaligned starting address for memory coalescing CUDA Programming and Performance	4	3666	March 31, 2011
Coalesced Access to Global Memory CUDA Programming and Performance	2	1941	April 13, 2012
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	724	December 5, 2019
Global Memory Coalescing on Devices with Compute Capability 1.2 and Higher CUDA Programming and Performance	3	693	June 4, 2015
Question about coalesced memory access CUDA Programming and Performance	10	2905	September 24, 2009
Coalesced Memory access related doubt CUDA Programming and Performance	13	2229	December 9, 2010

Loading global memory into shared memory: alignment?

Related topics