I get confused after listening to the Illinois GPU course (lecture 10) and reading the programmer manual 5.1.1.2. As the streaming processors are SIMD processors, and they share the same instruction unit, all 8 processors should fetch the same instruction and “execute” it synchronously. As far as I can image, predication is the only way that can achieve this goal (i.e. allowing branching in SIMD processor). But it says that if the instruction number is greater than 7(or 4 in some other cases), compiler will not choose predication, but to serialize different execution path.
My question is: how is this “serialize different execution path” implemented? how is it different from predication?
You should regard the SIMD units as having a width of 32 threads (the warp size). All the threads in a warp execute together. If you have divergence within such a warp, the threads have to wait for each other.
Separate warps on the other hand, can follow different execution paths, by following branch instructions. That’s why you do need __synthreads() to synchronize the threads in a block.
Now there is an difference in the actual instructions generated. Predicated instructions execute depending on a logical predicate bit. (@$p0 before the line in ptx) Apart from that, CUDA allows for ‘normal’ flow control. This will block a part of the threads in the warp on a branch and reactivate them on a ‘join point’.
As you probably guessed, predication is more efficient within a warp, normal flow control when different warps take different paths, or when the entire block takes the same path. This is a hard choice, and the compiler tries to make a guess at it.
so there are two ways to implement branching on SIMD processors: predicate and ‘normal’ flow control. I get it about the predicate bit and depending on its true or false value to execute the generated predicated instruction. But that is exactly my initial guess about how CUDA implement ‘normal’ flow control on SIMD before I knew that it is in fact the predication approach. So the other approach must be different from predication approach. So how CUDA magically block a part of the threads on a branch and later reactivate? (Don’t they also have to have a testing bit, and do exactly like predication?)
As I don’t understand the other approach, I guess they are about the same efficient. For example,
if (threadID > N) {
//part A takes time T1
instruction 1;
i2;
i3;
}
else {
//part B takes time T2
i4;
i5;
}
If the compiler discover that partA and part B have less than 4 instructions, the compiler would generate predicated instruction for both A and B, and preceeds them with a predicate assignment.
For threads with threadID > N, when they are executing instruction i1 to i3, other threads are also executing the same instructions, only they don’t write output and load/store memory. So the total time would be the same as if both the then part and the else part of if statement are executed sequentially (that is time = T1 + T2).
For the other approach, when threads with threadID > N are running, CUDA somehow magically block other threads, and it takes time T1 for them to run. After that, it blocks the running threads and reactivate the originally blocked threads to execute the “else part” of if statement, this takes time T2. Thus it also takes time T1+T2.
I don’t understand how it is more efficient.
PS: I tried to make it concise, but English is not my first language and a bit hard for me to express my question clearly.
The key point is: the processor is capable of running more threads than the SIMD width via time slicing. And normal branching is possible this way.
In case of the “threadID > N”, some warps completely pass the test and some warps completely fail the test. Those warps are therefore not divergent and simply have their instruction pointer modified. At most one warp diverges. If that happens, the divergent warp is handled in an unspecified but correct way.
In the end, the result within a warp is of course the same. Both normal flow control and predication end up executing part of the threads and blocking the rest.
But predication is more efficient in some cases because “magical” way incurs some overhead, while predication is very light and usually used to mask only a few instructions.
Internally. normal flow control is implemented by predicating a branch instruction :)
Anyway, if possible you should avoid divergence within warps where you can. It seems the second-most source of performance loss (after non-coalesced global memory access)
Note also that nvcc will not generate predicated instructions, despite what the guide says. I was schooled on this in another thread. I’ve tried and tried to get some inner if’s in my kernels to become predicated in the ptx without success. The closest I’ve managed is something like:
if (a < b)
r = 0.0f;
else
r = a*b;
which compiles to use a selp instruction. That is sort of predicated: but as wumpus pointed out, “@$p0” is real predication. I’ve never seen such an instruction come from nvcc.
In a SIMD environment, all branches are executed anyway, whether that is implemented using predicate instructions or by diverging the warp. Is there a reason predicates would result in better performance?
If it’s only three instructions it might be better to use predication, otherwise flow control. The reason is that branches incur extra overhead (at least on most other architectures, so I suppose the G80 is no exception)