Accessing same global memory address within warps

Hi all,
right now i am implementing particle in cell codes for cuda gpus. So i have millions of particles and a much lower count of grid cells. The particles are sorted corresponding to their position in the grid. When i move these particles i have to apply the corresponding forces from the grid cells, which means that several threads need to access the same memory address to get these force values.So what i would like to know is: Are these memory accesses coalesced or not? I would say no because it is not accessing consecutive memory addresses but only one. So if it is not coalesced is there a way to avoid uncoalesced memory accesses? I have a really hard time to figure this out so any help is appreciated.
Philipp

Memory accesses from separate warps (i.e. from threads not in the same warp) are never coalesced. Memory accesses from the same warp but emanating from different instructions in the instruction stream are never coalesced. However in each case you may still get some benefit from the cache, compared to going to global memory for all accesses.

If threads in the same warp access (read) the same location in the same instruction/clock cycle, the memory subsystem will use a “broadcast” mechanism so that only one read of that location is required, and all threads using that data will receive it in the same transaction. This will not result in additional transactions to service the request from multiple threads in this scenario.

coalescing is a “grouping” process (look up the english definition of the word “coalesce”) that the memory subsystem uses to group locations together that are close enough to each other so that they can be serviced in a single transaction. Rather than using coalesce for requests to the same location from the same warp in the same instruction/issue cycle, I believe it is clearer to use the word “broadcast” since that is how the programming guide describes it. There is only one read transaction in that case, and the results of the read transaction are “broadcast” to any threads that requested it.

This assumes relatively current hardware, e.g. cc 3.0 or newer. Broadcast rules were different for very ancient CUDA devices, e.g. cc1.x

If a warp accesses the same addresses several times, then the memory instruction is coalesced. Internally, NVIDIA GPUs only support gather instructions, which on newer GPUs are implemented like:

while (not all threads in warp are served)
     request cache line l of first thread, which is not served yet;
     for each thread t in warp
          if (t is not served yet && requested address of t is in l)
               serve t with l;

Note that gather instructions are implemented the same way on modern CPUs, if one replaces the misleading cuda lingo (thread, warp) with the common CPU lingo (SIMD lane, thread). However, the outer while loop may require some sort of replays on older GPUs (Fermi) and CPUs (Knights landing), while on newer GPUs and CPUs the loop is completely handled by the load store units.

The L1 cache starting with Fermi can handle multiple threads accessing the same address (on a load or store) without serializing. In order to optimize memory performance have all threads access as few 32 byte sectors as possible. Consecutive memory addresses is not a requirement for efficient memory accesses.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0

[i]Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.

Note that threads can access any words in any order, including the same words.[/i]

For Kepler Architecture review http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf starting at Slide 27.

The L1 cache in Maxwell - Turing are more forgiving than Kepler. There is significantly smaller penalty for accessing sectors outside of a 128 byte cache line.

The CUDA profilers can help identify bad memory access patterns at an instruction level.

Thx for the fast answers. They really cleared things up.