Accessing same global memory address within warps

phil_ra12345 · October 24, 2018, 1:05pm

Hi all,
right now i am implementing particle in cell codes for cuda gpus. So i have millions of particles and a much lower count of grid cells. The particles are sorted corresponding to their position in the grid. When i move these particles i have to apply the corresponding forces from the grid cells, which means that several threads need to access the same memory address to get these force values.So what i would like to know is: Are these memory accesses coalesced or not? I would say no because it is not accessing consecutive memory addresses but only one. So if it is not coalesced is there a way to avoid uncoalesced memory accesses? I have a really hard time to figure this out so any help is appreciated.
Philipp

Robert_Crovella · October 24, 2018, 1:14pm

Memory accesses from separate warps (i.e. from threads not in the same warp) are never coalesced. Memory accesses from the same warp but emanating from different instructions in the instruction stream are never coalesced. However in each case you may still get some benefit from the cache, compared to going to global memory for all accesses.

If threads in the same warp access (read) the same location in the same instruction/clock cycle, the memory subsystem will use a “broadcast” mechanism so that only one read of that location is required, and all threads using that data will receive it in the same transaction. This will not result in additional transactions to service the request from multiple threads in this scenario.

coalescing is a “grouping” process (look up the english definition of the word “coalesce”) that the memory subsystem uses to group locations together that are close enough to each other so that they can be serviced in a single transaction. Rather than using coalesce for requests to the same location from the same warp in the same instruction/issue cycle, I believe it is clearer to use the word “broadcast” since that is how the programming guide describes it. There is only one read transaction in that case, and the results of the read transaction are “broadcast” to any threads that requested it.

This assumes relatively current hardware, e.g. cc 3.0 or newer. Broadcast rules were different for very ancient CUDA devices, e.g. cc1.x

Fiepchen · October 24, 2018, 1:18pm

If a warp accesses the same addresses several times, then the memory instruction is coalesced. Internally, NVIDIA GPUs only support gather instructions, which on newer GPUs are implemented like:

while (not all threads in warp are served)
     request cache line l of first thread, which is not served yet;
     for each thread t in warp
          if (t is not served yet && requested address of t is in l)
               serve t with l;

Note that gather instructions are implemented the same way on modern CPUs, if one replaces the misleading cuda lingo (thread, warp) with the common CPU lingo (SIMD lane, thread). However, the outer while loop may require some sort of replays on older GPUs (Fermi) and CPUs (Knights landing), while on newer GPUs and CPUs the loop is completely handled by the load store units.

Greg · October 24, 2018, 1:25pm

The L1 cache starting with Fermi can handle multiple threads accessing the same address (on a load or store) without serializing. In order to optimize memory performance have all threads access as few 32 byte sectors as possible. Consecutive memory addresses is not a requirement for efficient memory accesses.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0

[i]Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.

Note that threads can access any words in any order, including the same words.[/i]

For Kepler Architecture review http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf starting at Slide 27.

The L1 cache in Maxwell - Turing are more forgiving than Kepler. There is significantly smaller penalty for accessing sectors outside of a 128 byte cache line.

The CUDA profilers can help identify bad memory access patterns at an instruction level.

phil_ra12345 · October 24, 2018, 6:07pm

Thx for the fast answers. They really cleared things up.

Topic		Replies	Views
coalescing problem CUDA Programming and Performance	4	1133	August 8, 2011
Is these way coalesced access? CUDA Programming and Performance	0	421	March 6, 2020
Single address coalescing CUDA Programming and Performance	2	9564	January 29, 2011
Memory coalescing in one thread CUDA Programming and Performance	17	16811	March 31, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2238	December 9, 2010
memory coalescing CUDA Programming and Performance	4	5527	June 10, 2011
Global memory broadcasting? CUDA Programming and Performance	4	5833	October 2, 2008
Do these two global memory coalesced access pattern have same performance in thoery? CUDA Programming and Performance cuda	3	381	December 17, 2022
Coalesced Access to Global Memory CUDA Programming and Performance	2	1945	April 13, 2012
Memory Coalescing CUDA Programming and Performance	5	9361	October 15, 2011

Accessing same global memory address within warps

Related topics