Wacking the CUDA performance Is this how you can screw up you CUDA

application’s performance?

My question is in CUDA app, will a small divergence immediately followed by convergence going to trash the performance of whole application? Consider the following pseudo code.

wacko(void)
{
// assuming there are 8 threads
// and the threadID are 1,2,…8
int threadID = getthreadID();

int InverseOfID;

switch (threadID)
{
case 1: InverseOfID = 8; break;
case 2: InverseOfID = 7; break;
.
.
case 8: InverseOfID = 1; break;
}

// very lengthy math calculation goes here
}

When this CUDA function is called with block size of 8 and grid size of 1, I assume CUDA itself is going to create all 8 threads on a single multi-processor. Since that switch block in the code serializes the threads, eventhough the remaining code can be executed in lock steps, will it be correct to say the performance of this code be reduced to 1/8? Also, will it fix performance issue if a synchonize function call is added before ‘very lengthy math calculation goes here’? I really like to hear your inputs and comments.

Thank you.

Yes, the 8 threads will execute on one multiprocessor since they belong to the same block and they will be part of a single warp.

This warp will diverge when executing the switch statement, but then reconverge to execute the remaining code in lock-step fashion. There’s no need to add __syncthreads() after the switch statement for this to happen.

Cyril

Also note that you probably can do better than a switch for that. You could either use a small lookup table or if the computation of “inversethreadID” is analytical, use a function:

__device__ int inverseThreadID()

{

    return blockDim.x - threadIdx.x;

}

Then there will be no divergence.

Mark

Cyril,

Thank you. I am just curious. When the code re-converges, all threads will be executed at the same time. Is decission make on the hardware level or the CUDA compiler actually emitts code to give the G80 some hints about the nature of the code path? The reason I brought this up is my app has a lot of very small fragmented diverged codes between lengthy converged codes. I want to make sure these small fragments diverged code don’t screw up the performance. Another example in our code would be like.

runa()
{
.
.
.
runb()

}

runb()
{
.
.
.
}

start(void)
{
if (a == B )
runa()
else
runb()

}

In this case. I guess the threads will converge at the starting point of runb().

Mark,

Thank you for your reply and sorry for the confusion. I just make that up for I thought it would be easier to get my point across.

Yes, I figured as much, I just wanted to use the opportunity to educate anyone who may be reading and might not know the difference. :)

The threads in your new example will converge after the return from runb(). The compiler provides information in the code to help the hardware know when threads may diverge and converge.

Mark

I see. We have to be very careful when writing the code. Will calling synthread() in beginning of runb() force convergence?

Still don’t know what happends with this code:

const unsigned int warpID=threadIdx.x&31;

if (warpID&1) doThis()

else doThat();

Is there an instruction pointer per warp (with an interleaved execution of different code sections) or is this serialised (in HW or by compiler) with the even/odd warps masked out (so they don’t hurt).

Also at first sight this difference may seem irrelevant, I wonder if there could be some synchronisation/communication between doThis/doThat via shared mem.

Obviously this is only possible, if they run interleaved.

It would be absolutely wonderful if serialization only applies to groups of threads that going down to a different execution path. In other words, threads taking the same execution path after decission making will still be executed in lock steps.

EDIT: Taken directly from CUDA doc. Page 29. Section 6.1.1.2

When the condition what controls an if or a switch instruction is likely to produce a lot of divergent warps, the compiler might choose to replace the instruction with branch predication to avoid the overhead associated with conditional branching.

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped. Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions get scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate don’t write any result, but also, don’t evaluate any address and don’t read any operand.

The compiler replaces a branch instruction with branch predication only if the number of instructions controlled by the branch condition is less or equal to a cetain threshold: If the compiler determines that the condition is likely to produce many divergent warps, this threshold is 7, otherwise it is 4.

My question is. Can we change that number? For Knaxkopp’s case, I don’t think branch predication will be used since the compiler doesn’t know how big the doThis(), doThat() functions are.

You don’t need to force convergence, it will happen automatically. Putting a __syncthreads() in runb() would be illegal, since it’s inside a divergent branch (it will hang your machine).

Mark

Is this detected by the compiler or we just have to be awared of it?

They are serialized by the hardware. As long as the granularity of the divergence is a multiple of the warp size, they “don’t hurt”.

Communication between threads via shared memory requires __syncthreads(). Since in this case it would require a __syncthreads() inside a divergent branch, this is illegal (it would hang the GPU).

Mark

Took a look at Knaxkopp’s codes again. If number of instruction for the call instruction is within the threshhold, the compiler might use predicate the instructions. Hence threads satisfies the condition will form a warp and the other threads will form another warp. Then again, the compiler might not think 2 warp is enough to use instruction predicate, and instruct the multi-processor to execute each thread sequencially. Mark, what is your opinion on this?

Threads do not move around between warps. Threads 0-31 are warp 0, threads 32-63 are warp 1, etc. The hardware can’t gather threads from different branches into a new warp.

Mark

Is there a recommended way to “idle” a thread?

The application is working backward through a tree. At first, many threads operate on the leaves of the tree. The next step back in the tree, a portion (let’s say half) of the threads work on the nodes at that step, so the remaining threads are idle. Eventually, there will be only one thread working (on the root node of the tree).

To idle the remaining threads, I’m using conditional code:

if (threadId < threadIdActiveMax)

{

  // do some useful stuff on this node in this thread

  ....

}

else

{

  // do nothing in this code block -- it is a null block

}

(Yes, I understand that this is “wasting” computational power, but in a recombining binomial tree, for instance, it’s not so bad.)

Is this the best way to idle threads?

I don’t think you even need the else-clause. Simply choose which thread is active with the if-statement, as you have right now. You’ll have divergence, with all the excluded threads waiting to have eligible instructions to execute (which won’t happen until they converge, or your kernel terminates). So, you’ll get the desired effect.

Paulius

Right – I included it for emphasis.

Thanks!

As long as the divergence is a multiple of the warp size, it will be efficient. When you get down to less than 32 threads, consider unrolling the loop, and if your algorithm can operate correctly without the if at this level, leave it out – that way you have no sub-warp divergence.

This will work for most parallel sum reductions, for example.

Mark