In trying to make non-divergent code blocks, I keep finding myself in a position where a simple single cycle tertiary operation would make many interesting things possible.
result = ( test ) ? valueA : valueB;
even if it had to be designed in as a more restrictive function… something like
result = tertiary( test, vA, vB );
I don’t need to branch into an entire block of divergent code or anything that dramatic. I just need a way to pick 1 of 2 values.
Does this exist, and I’m just ignorant of it? If so, someone please enlighten me.
If it doesn’t exist, would there be any consideration in future releases of Cuda or future chip sets?
This operation exists in ptx as the instruction selp. I’ve found that if I give the C compiler a simple enough if statement, it will generate a selp instruction.
Actually, all if statements shorter than 7 instructions or so don’t cause divergence but use something called predication. This is in addition to selp, which actually works as part of the predication system. So, you don’t have to try to bend backwards to generate selps (you can have predicated moves, multiplies, anything.).
Divergence in general is not a big deal, if it’s limited to subsections of code. (You don’t want to put your whole kernel in a divergent switch(), but if a small part of it diverges and executes multiple times, then there’s no additional harm.)
You are quoting what the programming guide says, so I would normally agree with you. But have you ever seen the compiler generate predicated instructions? I haven’t except for the selp call I mentioned. Mind you, the last time I checked for this was with CUDA 1.1.
To the OP, you are slightly at the compiler’s whim at this one so I’m afraid it is not a guarantee.
I’ve had good experiences with if conditions like that being turned into selp instructions if both vA and vB were precomputed (actually, I’ve only tried it with vA precomputed and vB=0). If you put the instructions to calculate vB in the if, the compiler may or may not generate the selp.
And “condition” needs to be a really simple condition test. If you have any ands, ors or nots in there it may decide to use branching instead of selp (it probably does this to get the boolean logic shortcutting done correct to C++ semantics). I’ve had some luck by precomputing a condition=a & b | c (note the bitwise logic), but I don’t think it worked every time.
A compiler intrinsic would be nice, I agree, but NVIDIA has chosen the design road of implementing only a minimal set of compiler intrinsics, mainly to support the intrinsic hardware floating point calculations, so don’t expect it to get added. If you can come up with a reasonable set of test cases and demonstrate that the ptx code doesn’t generate selp instructions where you think it should AND this adversely effects your performance, NVIDIA will probably acknowledge it s a bug/feature request and take care of it in a future version of CUDA.
Interesting. I must say I never really checked. But the question really is: what is the overhead of a branch? If the overhead is very small, then there’s no reason to use predicates. Predicated branches, after all, are in fact explicitly divergent branches.
There’s something like 50 intrinsics at the end of the Programming Guide, and the majority are really poorly documented. I’m not sure what NVIDIA is thinking.
I have all the time. Note that I always look at the cubin with decuda, not the ptx (I find the generated ptx code just unreadable and often misleading where performance is concerned), it might be that they are introduced in the ptxas step, not by nvcc.
Ah, that explains it then. The programming guide really isn’t lying to us. I guess I’ll have to go dissasemble my performance critical code with decuda now and see where the compiler is or isn’t doing dumb things.