Understanding misaligned access pattrerns

I’ve been trying to understand the inner workings of memory and coalescing. Cuda programming best practices has some great example on how bandwidth drops when we have a misaligned access pattern. I’ve been wondering on why does this happen. Looking into how DRAM works I understand that there are 2 things that influence the speed of access. First is that memory arrays are divided into rows and columns, and when we access the data inside the same row we skip the prefetch phase and row loading phase. Secondly after the row access some values get stored in the burst buffer that is faster to read. Looking at the specs for GDDR6 the page size is 1KB and the burst length is 64 bytes. I’m probably missing something here, where does the 32B alignment rule come from?

Also when trying to reproduce the results I’m seeing that on a 4090 I need 128B alignment to see meaningfull differences in execution time for the example kernel.

Running of a T4 in google colab reproduces the 32B results


Did alignment needs change between the architectiures?

This is dictated by the segment sizes the load/store unit, handling access to global memory, accepts. There is more detail here.

Each bit of the 32 bits are put onto another DRAM chip.

Also DRAM is not accessed directly, but first L1 and especially L2 cache afterwards. Caches have to manage concurrent accesses and use so called tags to store meta data. Introducing alignment requirements simplifies all that a lot. E.g. no overlapping between aligned 128 bit blocks. Saving and caching individual bytes or overlapping blocks would be much more work.

Early CUDA architecture had much more strict requirements for performant memory accesses and how memory accesses were split into transactions (e.g. page 164 of this older programming guide: https://www3.nd.edu/~zxu2/acms60212-40212/CUDA_C_Programming_Guide.pdf).