Understanding Coalescence

Dear All

 Is faster to access in a K40 (memory 384 bits wide) this way: 

  384bits+384 bits in consecutive positions


  384bits+384bits in randon positions

My doubt is that coalescence is usufful due to wide memory access (384 bits) or there is also gains due to access of consecutive 384bits (with hardware help)


Luis Gonçalves

my understanding is that these are two completely separate things

memory susbsytem divided into so-called channels, each 32 or 64 bit wide (depending on GPU architecture). Each channel is attached to the corresponding part of memory chips, i.e. if channels are 64-bit wide, overall memory is 384-bit wide and there are 6 GB VRAM, then each channel access only its own 1 GB of memory

memory space is divided into, say, 256 byte chunks, every next chunk goes into another memory channel. This ensuers that sequential memory access will spread work among all memory channels

OTOH, data received from memory go into L2 cache with 32-byte wide lines and/or L1 cache with 128-byte lines. Every memory operation of your program works with cache rather directly with memory. On every GPU cycle, the operation can access only 1 cache line, so if your data belong to multiple cache lines, the access will be slower and that are the coalescing rules

I think you do not answered my question. Despite caches, if I have to access data in main device memory, there some hardware mechanism that make faster to access 384bits+384bits consecutive aligned bits faster than 384bits+384bits aligned and random?


Luis Gonçalves

it’s the first part of my answer: gpu doesn’t access data in 384-bit entities. instead, it breaks VRAM into, say, 6 parts, each part accessed by one 64-bit memory controller (6*64=384). memory address is converted in such way that each, say 256-byte, block goes into the next memory controller

about speed of random memory access:

in short, you can’t make more than ~10^9 random reads per second, even if you read just a single bit at each address. you need to read data in at least 128-256 byte blocks to reach the full memory bandwidth