Hi,
thank you for this perspective and better insights into different kinds of synchronization.
I’m a practitioner with years of Cuda experience, but less from a parallel CS theoretical point of view.
As far as I understand, there is no way to guarantee that threads within a warp will not diverge, at least not on C++ or PTX level. And if there is a guarantee at SASS level, it would not be a useful guarantee, as SASS is not publicly documented by Nvidia.
Code, which relies on coalesced_threads()
staying true at least for a time without any conditionals afterwards may work in practice, but would be UB in a strict sense.
Nevertheless coalesced_threads
is (maybe) useful in generic code, iterative code or subroutines, where you can use the initially active threads for on-the-fly parallelism, i.e. a parallel algorithm uses those threads and syncs those threads afterwards.
// generic function
__device__ void do(int a)
{
// find out, which threads are active and somehow let them cooperate, e.g. cooperatively load data in a coalesced way
// and afterwards each thread processes parts of the data
coalesced_group active = coalesced_threads();
}
__global__ void kernel()
{
if (threadIdx.x < 16)
do(1);
else
do(2);
}
I think the official way (and as I currently handle it) is to not rely on non-divergence at all for correctness, but only rely on it for performance.
I would not be so sure, that the change was solely for CPU compatibility, but perhaps for internal design reasons/flexibility or for making it possible to accelerate algorithms with non-optimal source code better. E.g. there was a trend in several of the recent years to improve L1 cache and lessen the coalescing requirements.
It is some balance for keeping the SMs simple (to dedicate as much die area to computation), but also to deliver data fast enough (area for caches; with Ada Lovelace Nvidia increased L2 by a lot), but also to optimize compute usage/occupancy for a wide array of better or worse CUDA optimized algorithms.
So again, there is no possibility to guarantee no-divergence, even locally.
You would assume no-divergence for performance reasons.
But wherever it involves program correctness, you insert synchronization instructions.
- Synchronization instructions are implicit with voting or shuffle instructions.
- Warp-wide synchronization with
__syncwarp()
is cheap (as long as there is assumed no-divergence anyway) and would be only needed for memory operations (shared or global).
=> So there should not be a slow-down of SIMD.
There are only few scenarios, where you actually would need lots of synchronization.
The old warp-wide reduction code using shared memory comes to mind:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp();
shmem[tid] = v; __syncwarp();
v += shmem[tid+8]; __syncwarp();
shmem[tid] = v; __syncwarp();
v += shmem[tid+4]; __syncwarp();
shmem[tid] = v; __syncwarp();
v += shmem[tid+2]; __syncwarp();
shmem[tid] = v; __syncwarp();
v += shmem[tid+1]; __syncwarp();
shmem[tid] = v;
(source: taken from Using CUDA Warp-Level Primitives | NVIDIA Technical Blog (listing 8) after a quick google search for an example)
However, exactly this code with shared memory usage can be nicely replaced by warp shuffle instructions with their implicit synchronization.
So any remaining needs for __syncwarp()
will be few and far in between (e.g. at the end of each iteration of random read/write accesses into shared memory or after cooperatively loading data from global to shared memory).