I need to update an array of values according to the following rule:
All values not equal to the first minimum are updated to the first minimum in the array. The element which is the first minimum is updated to the second minimum.
If the array signal is equal to the element signal, signal is changed, if not same signal is applied.
Then again, if some if-else chains are replaced by the ternary operator ( ? : ) will it improve? I’ve seen PTX code from the same conditional statement with if-else chain and w/ ternary operator and for the given example ternary compiled to fewer instructions.
Bit twiddling is actually quite fun, but for CPUs some bit twiddling clamping techniques compile to far more instructions than naive implementations. In the GPU I would say I am more concerned with the degree of potential divergence generated by conditional statements that some extra instructions might be quicker than thread serialization. However, I am at a loss in this, since instruction predication is mentioned but its accessibility denied to the programmer, unlike other architectures such as Cell, and do not know exactly what type of optimizations will the compiler do for this case.
Of course I can compile my code with -keep option an inspect the PTX code myself. Never having done so, I rely on any CUDA developer with the knowhow to enlighten me, before I dive into PTX code.
Note that “diving into PTX” will not give a reliable picture of what executed code looks like, because PTXAS (which transforms PTX into machine code) does further if-conversion and predication. You can use cobjdump to look at the generated machine code to see what runs on the hardware.
I have looked at a lot of compiler-generated machine code since the dawn of CUDA, including verifying the proper use of predication at the machine code level. As the compiler is pretty mature at this point, my recommendation is to simply write CUDA code in the most natural way, i.e. using if-statements, and let the compiler worry about the rest. My observation is that the compiler makes the correct decision whether to use predication, selection, or branching (or some combinations of those) in the vast majority of cases.
Note that CUDA offers overloaded min() and max() functions that cover just about any scalar type and these functions map pretty much straight to appropriate hardware instructions. So I would recommend using these, rather than basing minimum / maximum computations on explicit comparisons.
I ended up implementing the bit twiddling, and the difference between the two methods is negligible. The code snippets I posted are a small fraction of a more complex kernel and represent the code with the greatest divergence.
It seems that the extra numerical operations in the bit twiddling effectively even the removal of divergence in the code. Given that the compiler is this mature, isn’t the Best Practices Guide high priority guideline concerning divergence overrated?
I ended up implementing the bit twiddling, and the difference between the two methods is negligible. The code snippets I posted are a small fraction of a more complex kernel and represent the code with the greatest divergence.
It seems that the extra numerical operations in the bit twiddling effectively even the removal of divergence in the code. Given that the compiler is this mature, isn’t the Best Practices Guide high priority guideline concerning divergence overrated?
I’d hazard a guess that your code is memory bandwidth bound, in which case any halfway reasonable implementation of the calculation will run at the same speed.
I’d hazard a guess that your code is memory bandwidth bound, in which case any halfway reasonable implementation of the calculation will run at the same speed.
Divergence is still an issue to watch out for, but the cases of very local divergence (a few C-statements in the body of an if-statement) are not the ones programmer’s should be worried about. The compiler typically handles these appropriately via predicated execution and/or select instructions, sometimes in conjunction with uniform branches. From looking at quite a bit of compiler-generated machine code, in many such cases the generated code is optimal. You can use cuobjdump to check on the compiler’s work.
Divergence is still an issue to watch out for, but the cases of very local divergence (a few C-statements in the body of an if-statement) are not the ones programmer’s should be worried about. The compiler typically handles these appropriately via predicated execution and/or select instructions, sometimes in conjunction with uniform branches. From looking at quite a bit of compiler-generated machine code, in many such cases the generated code is optimal. You can use cuobjdump to check on the compiler’s work.