Memory Coalescing


Can anybody help me understand the concept of memory coalescing. What is this and what are its advantages wrt to CUDA and GPU?


It is explained in depth in Chapter 5 of the user guide. The GPU has no SDRAM cache, unlike a conventional microprocessor, so loading and storing to SDRAM is slow (in the 100s of clock cycles per operation). The GPU memory controller has one mechanism to partially offset this high latency - it has the ability to read linear regions of memory in 16 word aligned, 16 word segments in a single operation. This is coalesced memory access. Any kernel which can structure memory access patterns to fit this 16 word access mode will run considerably faster than one which must sequentially read or write to/from SDRAM.

Also read the new best practices guide:

It explains the coalescing, common memory access patterns, and includes benchmark results to give you an idea of what the performance losses are with various non-coalesced access patterns.


I am new to CUDA development and I am also trying to understand the concept of Coalesced memory. Can you elaborate the above quoted lines in some more detail.

If the threads read consecutive global memory addresses, all the independent accesses are combined into one, that is a naive approach and not so exact (there are special cases, etc), but will give you the basis.

Anyway, this topic has been discussed a lot, and if you perform a search on google or the forum you will found extensive documentation.