I’ve been trying to understand the inner workings of memory and coalescing. Cuda programming best practices has some great example on how bandwidth drops when we have a misaligned access pattern. I’ve been wondering on why does this happen. Looking into how DRAM works I understand that there are 2 things that influence the speed of access. First is that memory arrays are divided into rows and columns, and when we access the data inside the same row we skip the prefetch phase and row loading phase. Secondly after the row access some values get stored in the burst buffer that is faster to read. Looking at the specs for GDDR6 the page size is 1KB and the burst length is 64 bytes. I’m probably missing something here, where does the 32B alignment rule come from?
Also when trying to reproduce the results I’m seeing that on a 4090 I need 128B alignment to see meaningfull differences in execution time for the example kernel.
Running of a T4 in google colab reproduces the 32B results
Did alignment needs change between the architectiures?