It’s admittedly difficult to understand in practice. In my view, the way to think about it is not as a practical experiment, trying to understand how a warp may diverge when there is no reason to do so. Instead, it’s best to accept it as a principle, because it promotes the right way to think about how to program in CUDA. The CUDA programming model does not guarantee it, and CUDA can be made to work in an acceptable fashion without that expectation. As has already been indicated, we are dealing with a difference between the programming model and the implementation. In the implementations we have today, its not possible (for me) to propose a situation when a warp diverges for no reason.
Yes, it’s possible to be divergent in a ballot function, and the __ballot_sync()
primitive was introduced to remove the ambiguity or uncertainty that a proper treatment of the programming model would create with the “old” __ballot()
intrinsic.
To interpret this, we would go back to the example I already gave, and posit the opposite, that the warp is not in lockstep. It is exactly the concern you asked about here:
Let’s go back to my example:
To show what the machine would be executing for a single step in the “unrolled warp-synchronous area”:
sdata[tid] += sdata[tid + 1];
It would look something like this:
LD R0, tid // it would actually be from another register or sequence of special registers
LDS R1, [R0]
LD R2, tid+1 // again this is shorthand for what actually happens
LDS R2, [R2]
FADD R3, R1, R2
STS [R0], R3
Now, we will suppose that the warp is not perfectly in lockstep. Execution order can be anything we want it to be, considering any two threads in the warp. Let’s take thread 0, which is reading the shared locations 0 and 1, and adding them together, and then storing them at location 0, and thread 1, which is reading locations 1 and 2, and adding them together, and storing them at location 1. So thread 0 is reading location 1 and thread 1 is both reading and writing location 1. If the order of execution between thread 0 and thread 1 can be anything (a fundamental premise of the CUDA programming model), then thread 1 could execute completely before thread 0, in which case the addition of location 1 and 2 would have already happened, and the store to location 1 would have already happened, at the point that thread 0 begins to execute, and so it reads a value from location 1 which is potentially not the same as what it would have read in the lockstep case, or in other cases we could posit (such as the case where thread 0 executes completely before thread 1, rather than completely after thread 1). In either of these cases, the write (to location 1, by thread 1) happened before the read of location 1, by thread 0. This is a case that probably nobody would intend, if they were writing a sweep-style parallel reduction.