No conditional statements Ternary operator or bit twiddling or predicate instructions?

Hello,

I need to update an array of values according to the following rule:

  • All values not equal to the first minimum are updated to the first minimum in the array. The element which is the first minimum is updated to the second minimum.

  • If the array signal is equal to the element signal, signal is changed, if not same signal is applied.

  • Minimums are computed by their absolute value.

So wo/ conditional statements we have:

/* N+1 sized array w/ elements x0, x1, x2, ..., xN */

//Retrieve signal

sx0=0x80000000 & x0;

sx1=0x80000000 & x1;

...

sxN=0x80000000 & xN;

// Array signal

sigTotal= sx0 ^ sx1 ^ ... ^ sxN;

// Init

min1 = x[0] ^ ((x[0] ^ x[1]) & -(x[0] < x[1]));

min2 = x[0] ^ ((x[0] ^ x[1]) & -(x[0] < x[1]));

//Update first and second minimums

former = min1;

min1 = x2 ^ ((min1 ^ x2) & -(min1 < x2));

min2 = former ^ ((former ^ min2) & (former==min1));

...

former = min1;

min1 = xN ^ ((min1 ^ xN) & -(min1 < xN));

min2 = former ^ ((former ^ min2) & (former==min1));

// Update

x0 =-((min1 ^ ((min2 ^ min1) & - (x0 == min1))) & -((sx0 ^ sigTotal) == 0)) | ((min1 ^ ((min2 ^ min1) & - (x0 == min1))) & -((sx0 ^ sigTotal) != 0));

x1 =-((min1 ^ ((min2 ^ min1) & - (x1 == min1))) & -((sx1 ^ sigTotal) == 0)) | ((min1 ^ ((min2 ^ min1) & - (x1 == min1))) & -((sx1 ^ sigTotal) != 0));

...

xN =-((min1 ^ ((min2 ^ min1) & - (xN == min1))) & -((sxN ^ sigTotal) == 0)) | ((min1 ^ ((min2 ^ min1) & - (xN == min1))) & -((sxN ^ sigTotal) != 0));

Is this more efficient than doing the same with conditional statements? For the update part:

...

//Same as above until Init

//Init

if(x0<x1){

	min1 = x0;

	min2 = x1;

}

else{

	min1 = x1;

	min2 = x0;

}

//Update first and second minimums

if(x2<min1){

	min2 = min1;

	min1 = x2;

}

else if(x2<min2)

	min2 = x2;

...

//Update

if(x0==min1)

	if(sx0 ^ sigTotal==0)

		x0 = min2;

	else

		x0 = -min2;

else

	if(sx0 ^ sigTotal==0)

		x0 = min1;

	else

		x0 = -min1;

...

Then again, if some if-else chains are replaced by the ternary operator ( ? : ) will it improve? I’ve seen PTX code from the same conditional statement with if-else chain and w/ ternary operator and for the given example ternary compiled to fewer instructions.

Bit twiddling is actually quite fun, but for CPUs some bit twiddling clamping techniques compile to far more instructions than naive implementations. In the GPU I would say I am more concerned with the degree of potential divergence generated by conditional statements that some extra instructions might be quicker than thread serialization. However, I am at a loss in this, since instruction predication is mentioned but its accessibility denied to the programmer, unlike other architectures such as Cell, and do not know exactly what type of optimizations will the compiler do for this case.

Of course I can compile my code with -keep option an inspect the PTX code myself. Never having done so, I rely on any CUDA developer with the knowhow to enlighten me, before I dive into PTX code.

Thanks.

Note that “diving into PTX” will not give a reliable picture of what executed code looks like, because PTXAS (which transforms PTX into machine code) does further if-conversion and predication. You can use cobjdump to look at the generated machine code to see what runs on the hardware.

I have looked at a lot of compiler-generated machine code since the dawn of CUDA, including verifying the proper use of predication at the machine code level. As the compiler is pretty mature at this point, my recommendation is to simply write CUDA code in the most natural way, i.e. using if-statements, and let the compiler worry about the rest. My observation is that the compiler makes the correct decision whether to use predication, selection, or branching (or some combinations of those) in the vast majority of cases.

Note that CUDA offers overloaded min() and max() functions that cover just about any scalar type and these functions map pretty much straight to appropriate hardware instructions. So I would recommend using these, rather than basing minimum / maximum computations on explicit comparisons.

I ended up implementing the bit twiddling, and the difference between the two methods is negligible. The code snippets I posted are a small fraction of a more complex kernel and represent the code with the greatest divergence.

It seems that the extra numerical operations in the bit twiddling effectively even the removal of divergence in the code. Given that the compiler is this mature, isn’t the Best Practices Guide high priority guideline concerning divergence overrated?

I ended up implementing the bit twiddling, and the difference between the two methods is negligible. The code snippets I posted are a small fraction of a more complex kernel and represent the code with the greatest divergence.

It seems that the extra numerical operations in the bit twiddling effectively even the removal of divergence in the code. Given that the compiler is this mature, isn’t the Best Practices Guide high priority guideline concerning divergence overrated?

I’d hazard a guess that your code is memory bandwidth bound, in which case any halfway reasonable implementation of the calculation will run at the same speed.

I’d hazard a guess that your code is memory bandwidth bound, in which case any halfway reasonable implementation of the calculation will run at the same speed.

Divergence is still an issue to watch out for, but the cases of very local divergence (a few C-statements in the body of an if-statement) are not the ones programmer’s should be worried about. The compiler typically handles these appropriately via predicated execution and/or select instructions, sometimes in conjunction with uniform branches. From looking at quite a bit of compiler-generated machine code, in many such cases the generated code is optimal. You can use cuobjdump to check on the compiler’s work.

Divergence is still an issue to watch out for, but the cases of very local divergence (a few C-statements in the body of an if-statement) are not the ones programmer’s should be worried about. The compiler typically handles these appropriately via predicated execution and/or select instructions, sometimes in conjunction with uniform branches. From looking at quite a bit of compiler-generated machine code, in many such cases the generated code is optimal. You can use cuobjdump to check on the compiler’s work.