sign() function

Thanks for pointing out signbit. Didn’t know and will use it, of course.

Branches (is this from another topic?):

Like them on trees, in data structures and cpu code, but dislike in cuda, because I think you risk an execution time which is the sum of all branches… (with some suitable qualifications). Divergence. If all threads in a warp take the same branch, it is no problem but no use either, I guess.

Please let me hear your thoughts on this?

Yes, this remark was rather directed at the original post. Also some other topics are talking about “branch-free” or “conflict-free” code that ends up with more than twice the number of instructions of each branch… Which sounds a bit strange to me.

Maybe there is some confusion with pipelined, superscalar architectures, where a branch mispredication can cause a higher penalty than executing both branches?

Yes, this remark was rather directed at the original post. Also some other topics are talking about “branch-free” or “conflict-free” code that ends up with more than twice the number of instructions of each branch… Which sounds a bit strange to me.

Maybe there is some confusion with pipelined, superscalar architectures, where a branch mispredication can cause a higher penalty than executing both branches?