Cost of serialization. The cost of wrap execution serialization

Hi, All –

I’ve read the relevant parts of the manual and done some searching, but I’m still confused about the exact cost of a wrap serializing instructions (NOT memory reads).

When a wrap (or half-wrap on the 8000 series, right?) serializes an instruction because of a branch in one or more threads that does not affect all threads in the wrap, does that mean that the processor basically puts 31 threads on hold and executes one at a time until everything comes back together? Does that mean serialization presents a 31x speed hit to execution time? When my kernels serialize because of flow control it has never seemed that bad.

Also, how does CUDA decide when to re-parallelize the execution? I’ve noticed that I don’t have to __syncthreads() after every branch to make things run fast, but how intelligent is nvcc in that regard?

Thanks for any insights,

Ben Weiss
Oregon State University Graphics Group

Yes, divergent warps do have a cost. Suppose you have a half warp, (16 threads) and there is an ‘if’ statement. Now suppose that only one of the threads actually takes the ‘if’ statement. The other 15 threads will be be idle until that one divergent thread is done. After the divergent thread is finished running all the code in the ‘if’ statement, the execution path converges, and all 16 threads can run concurrently from that point on. You don’t need to have __syncthreads() or anything like that for this to happen, it’ll all be automatic. Nvcc is actually very good at determining these points of convergence, so you really shouldn’t be worrying about it.

Generally it’s best to write your code so that there are no warp-divergent branches. But if you have some warp-divergent branches, it’s not the end of the world. The GPU can process many different warps at the same time, so it’ll still be faster than a CPU ;)

Just try not to have anything like this, and you should be fine:

if (statement only thread 0 will take)
{
do a whole lot of calculations
}
else if (statement only thread 1 will take)
{
do a whole lot of calculations
}
else if (statement only thread 2 will take)
{
do a whole lot of calculations
}

Thanks! That makes sense; that’s what I figured was probably going on, and it’s certainly still is faster than a CPU.

On a hardware level, then, does the device behave like a dynamic 16x SIMD (single instruction, multiple data) architecture? That is, am I actually wasting 15/16ths of the computational power in each branch, or can CUDA put that to work on another block?

Also, can CUDA lump more than one thread into the serialization? If I have a branch that all threads but 1 enter, does that cost me the same as 15 independent branches?

Thanks!

Nope, it will execute the branch for the 15 threads at the same time.

When there is a condition, all the threads that meet the condition are executed in parallel and only those who don’t meet the condition are put in hold.

Manu

Doesn’t divergence happen on WARPS and not half-warps?
So if you have one divergent thread with an if() statement, then 31 other threads will disable, not 15.

This is subtle, since instructions are computed in half-warps in parallel. But the instruction pointer and thread scheduler works on the level of warps.
See the reference manual section 3.1.

Basically warps=instruction scheduling granularity
half-warps = execution granularity.

Or is there an optimization where fully-disabled half-warps are not scheduled?
That seems likely, thinking about it, but has anyone measured it?

If so, then indeed, having one active thread would delay 31 other threads in the warp, but waste the throughput of only 15 half-warp co-threads.

Similarly. having one thread make a single device memory call would delay all 32 threads.

I think the programming guide states that currently, half-warps are executed concurrently, but programs should be written to have as few divergent warps as possible because future devices may execute all 32 concurrently. So on the very first devices like the 8800, I think that a program will only take a hit if a half-warp is divergent. I’m not sure about the GT280 though.