Forced Convergence in Divergent Code Paths

Here’s what I want to do - Suppose I have a condition that happens for certian data inputs, but not in any coalesced or even predictable fashion, but I want to trigger a full warp un-predicated computation if it shows up.

For instance, Let’s say that I have a MT style RNG with a 64 word wide state, so that each time I call rand() the warp should update the entire state over two warp wide operators. That is, athe entire warp works together synchronously to update the RNG. Now, imagine that I want to call the RNG from a code path that may be warp-divergent. Then only some of the threads in the warp will do their work, resulting in a half-baked RNG state. What I really want to have happen is to have the predicate flags cleared during the RNG update, no matter what their state was when the RNG was entered.

With the __any() intrinsic, this can be emulated, like so:

if (divergentCondition)
{
//do preparation work
}
if (__any(divergentCondition))
{
rand();
}
if (divergentCondition)
{
//do more work, using the results from rand()
}

Obviously, this is clunky and awkward from a coding perspective, especially if there are nested conditions! The compiler, however, should have no trouble dicing the code as needed to achieve the same result, though all variables used must be within the last scope which was fully convergent.

With this in mind, I propose two new intrinsics - __uniform and __asuniform.

__uniform would tell the compiler that a given conditional will always evaluate the same way across a warp. This can provide useful information to the optimizer (e.g. in deciding whether to use a conditional branch or to use predication), and is needed to provide a valid scope for __asuniform.

__asuniform tells the compiler to force the marked code block to be executed across the warp, no matter the current state of divergence. This could be done either through the __any() hack, or as a new hardware intrinsic for future hardware.

Another, more powerful way to do this would be through use of a __vote intrinsic, along with __pushVote and __popVote. Then you could force a arbitrary subset of a warp to execute, like so:

int myVote = __vote(executionCondition);
if (triggerCondition)
{
__pushVote(myVote);
//do work with threads meeting executionCondition
__popVote();
}

I believe someone mentioned how useful __vote() would be for reduction, since it allows instant calculation of indices for the remaining elements of the reduced list. - e.g.

int threadMask= (1<<(threadId.x&31))-1;
int index = __popc(__vote(!reductionDropped)&threadMask);

Instant warp-wide reduction!

The bottom line is that I want access to the predicate stack! ^.^

Here’s a related extract of the PTX isa documentation:

N.