Avoiding branching with an arithmetic trick...

Hi all,

I’m a CUDA newbie, and this is my first post on the forum, so good to meet you all!

My reason for looking at GPU computation is to calculate (and sum over) a large (about 100000x1000) matrix. My problem concerns time series data where each of the elements of this matrix represents an integral over time. Fortunately I can discretise this interval to make the maths easy. Let’s call this matrix B. I have a further two vectors, I and N, which correspond to event times for the indices of the matrix B. Then:

B[i,j] = p0 * I[j] + p1 * ( min( N[i] , I[i] ) - min( I[i] , I[j] ) )

It would be great to use the GPU to calculate this sum of products. However, I note that conditional statements are unlikely to make this approach effective, and here I have two. However, I could write the following:

min( N[i] , I[j] ) - min( I[i] , I[j] ) = ( N[i] -| I[i] ) - ( N[i] -| I[j] )

where the -| operator returns the difference between the operands if the difference is positive, and 0 otherwise - ie the result floors out at 0. My question is: does such an operator exist? If so, then it would be a great way around having to introduce the branching construct and destroy the thread parallelisation.

My solution so far is to define (using horribly loose notation):

a -| b = fabs( ( a-b )* !signbit( a-b ) )

This seems very complicated for what should be a simple operation, and I’m not sure whether this is using branching under the hood. Plus, it’s not robust as signbit() is defined in the standard as returning non-zero if the sign is negative. Any advances on this??

Many thanks,


How about (a-b )*(a-b>0) :) ?

What branching are you talking about? If you think ‘min’ has branching, then you are wrong - ‘min’ is an instruction. And in general, if your algorithm has to branch, no need to replace it with math, just try make all threads in a warp take the same path, and the rest leave to the compiler.

put this in code tags and/or disable the emoticons please when writing code.

…meaning I could just use [codebox]( N[i] < I[j] ? N[i] : I[j]) - ( I[i] < I[j] ? I[i] : I[j] )[/codebox] ? I was under the impression that the conditional consituted a branch. It’s a stochastic algorithm, so a little difficult to organise warps into groups without a whole bunch of code I don’t really want to write.



I’m fairly sure that conditional with give you a branch. I’m also fairly sure that min and max won’t give you a branch and are implemented in hardware. The profiler will easily tell you if I’m wrong.

Have you measured the performance difference? In my experience divergent branching isn’t as bad as you might think.

I will update you soon…

This snippet:

a -| b = fabs( ( a-b )* !signbit( a-b ) )

could be implemented with two PTX instructions:

sub.f32       d, a, b;      // d = a - b;
slct.f32.f32  e, d, 0f, d;  // e = (d >= 0) ? d : 0;

No branching.

You also can add a “.ftz” modifier (flush subnormal) to either opcode.

I assume NVCC will generate that PTX from this:

const float t = a - b;
return (t >= 0f) ? t : 0f;

Going back to the starting post in this thread it seems to me the poster assumed that the implementation of min() would involve some sort of conditional execution, while in fact NVIDIA GPUs have instructions for computing the minimum and maximum of integer and floating-point operands. Based on that, the original formula should be used as-is, resulting in the most natural and readable code.

Just pointing out that the baseline for the primitive he identified is probably two opcodes.

But his original “min()-min()” code was just fine and probably only 3 instructions (vs. 5).

(I didn’t notice how old this thread was!)

The CUDA math library actually has a function for the “positive difference or zero” functionality called fdim():


fdimf() generates:

setp.gtu.f32  %p1, %f1, %f2;
sub.f32       %f3, %f1, %f2;
selp.f32      %f4, %f3, 0f00000000, %p1;


FSETP.GTU.AND P0, PT, R2, c [0x0] [0x144], PT;
FADD          R2, R2, -c [0x0] [0x144];
SEL           R2, R2, RZ, P0;

Trust but verify! :)

Gist here.

I do not recall why fdimf() is implemented this way, instead of fmaxf(a-b,0.0f) for example. It may have something to do with the correct handling of NaNs. I will check on Monday.

[Later:] The relevant standards require fdim() to return NaN if either of the inputs is NaN. On the other hand they require fmax() to return the non-NaN operand if only one of the inputs is NaN. Thus fdimf(x,y) cannot be implemented as fmaxf(a-b,0).

Then I guess that requirement also blocks the two-instruction “slct” implementation because of this note in the PTX manual:

If operand c is NaN, the comparison is unordered and operand b is selected.

i.e. slct(NaN,0f,NaN) would always return 0f instead of NaN.

Floating point is mysterious in its ways. :)

In general, IEEE-754 specifies that operations with NaN inputs must return NaN (note that it does not have to be one of the input NaNs). I don’t recall the reasons why the IEEE-754 committee deviated from this when it specified the min / max behavior for the 2008 revision; I have a vague recollection that it has something to do with preserving the maximum amount of information when searching for the minimum or maximum element in an array. The notes of the committee meetings may still be online somewhere.