Just a quick question regarding memory coalescing & divergent branches…
Say you have a kernel, which loads/stores memory inside of a branch - which is coalesced assuming the branch doesn’t diverge.
some_var = some_gmem[threadIdx.x];
What happens in the case where the branch does diverge?
Does CUDA use make the divergant/idle threads load the gmem anyway (as reading a gmem address is harmless) - to coalesce the transfer, thus lowering bandwidth?
Clearly in the case of a memory store, this can’t be done (as writing to the address with undefined data is clearly not an option) - but for loads, it seems like a reasonable optimization the kernel should/would do…
I also realise I can do it myself manually, I’m just thinking out loud here.
Good question, you should ALWAYS worry about memory access patterns and ask if what you’re doing is inefficient.
Memory access is by far the most common bottleneck!
Your specific question is do divergent threads interfere with coalesing.
No, they don’t. There’s no penalty (but also no savings) if some of the threads are suspended, the memory transfer will still have the same behavior even if all of the values in the read aren’t used. This is true both in G80 and G200.
The programming guide mentions this and shows one diagram as an example (where a few threads don’t read).
However, digging one level deeper into the memory abstractions, there IS a potential bandwidth/speed savings if you happen to have the first 16 or last 16 threads all diverge. Since memory transfers work on half-warps, if all 16 threads at the start or end aren’t active, you’ll save that half’s transfer. This is sometimes useful to plan on when you’re juggling tight memory use. For example, my raytracer reads nodes which are 16 words in size, so my warp only uses the first 16 threads for loading.