Understanding misaligned access pattrerns

szymon.ozog · October 12, 2024, 8:32am

I’ve been trying to understand the inner workings of memory and coalescing. Cuda programming best practices has some great example on how bandwidth drops when we have a misaligned access pattern. I’ve been wondering on why does this happen. Looking into how DRAM works I understand that there are 2 things that influence the speed of access. First is that memory arrays are divided into rows and columns, and when we access the data inside the same row we skip the prefetch phase and row loading phase. Secondly after the row access some values get stored in the burst buffer that is faster to read. Looking at the specs for GDDR6 the page size is 1KB and the burst length is 64 bytes. I’m probably missing something here, where does the 32B alignment rule come from?

Also when trying to reproduce the results I’m seeing that on a 4090 I need 128B alignment to see meaningfull differences in execution time for the example kernel.

Running of a T4 in google colab reproduces the 32B results

Did alignment needs change between the architectiures?

rs277 · October 12, 2024, 6:11pm

This is dictated by the segment sizes the load/store unit, handling access to global memory, accepts. There is more detail here.

Curefab · October 12, 2024, 6:32pm

Each bit of the 32 bits are put onto another DRAM chip.

Also DRAM is not accessed directly, but first L1 and especially L2 cache afterwards. Caches have to manage concurrent accesses and use so called tags to store meta data. Introducing alignment requirements simplifies all that a lot. E.g. no overlapping between aligned 128 bit blocks. Saving and caching individual bytes or overlapping blocks would be much more work.

Early CUDA architecture had much more strict requirements for performant memory accesses and how memory accesses were split into transactions (e.g. page 164 of this older programming guide: https://www3.nd.edu/~zxu2/acms60212-40212/CUDA_C_Programming_Guide.pdf).

Topic		Replies	Views
Misaligned Data Access Has No Effect on Performance? CUDA Programming and Performance	7	2126	May 24, 2018
aligned and misaligned structures CUDA Programming and Performance	2	975	September 30, 2009
Variable latency/bandwidth of main memory CUDA Programming and Performance	3	3295	October 3, 2007
question on memory coalescing and alignment CUDA Programming and Performance	0	1892	January 28, 2008
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	852	April 7, 2024
Memory access - data alignment How does the data alignment in opencl work? CUDA Programming and Performance	0	4679	July 6, 2010
Memory transaction size CUDA Programming and Performance	4	14193	April 13, 2009
CUDA shared memory CUDA Programming and Performance cuda	2	506	December 30, 2023
Alignement requirement CUDA Programming and Performance	1	3321	August 16, 2009
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	972	November 17, 2017

Understanding misaligned access pattrerns

Related topics