 # Maximum, branch prediction

Hello,

I’m currently implementing the classic parallel reduction kernel to find the maximum of an array along with its index.

Then, I use fmaxf() to compute the maximum at each step, but, as I want the associated index, I need do perform another test.

I wondered what is the best way to do it without introducing branch divergences.

If I write :

``````vmax = fmaxf(v1, v2);

idxmax = v1 > v2 ? idx1 : idx2;
``````

is it equivalent to :

``````vmax = fmaxf(v1, v2);

if(v1 > v2)

idx = idx1;

else

idx = idx2;
``````

?

I’ve read some things about “branch prediction”, but nothing precise. If the previous code is “branch predicted”, why not write (without using fmaxf anymore) :

``````if(v1 > v2) {

idx = idx1;

vmax = v1;

}

else {

idx = idx2;

vmax = v2;

}
``````

if it is “branch predicted”, it seems that it will be faster than the version with fmaxf.

What is the “limit” of branch prediction ?

What would you advise me to do ?

Thank you !

If you use cuobjdump to check assembly code, then compiler translate

``````vmax = fmaxf(v1, v2);

idxmax = v1 > v2 ? idx1 : idx2;
``````

to 4 instructions

``````float vmax = fmaxf(v1, v2);

bool pred = v1 > v2 ;

int idxmax = idx2;

if ( pred ){

idxmax = idx1 ;

}
``````

However

``````if(v1 > v2) {

idx = idx1;

vmax = v1;

}else {

idx = idx2;

vmax = v2;

}
``````

is translated to 5 instructions

``````bool pred = v1 > v2 ;

vmax = v2;

idxmax = idx2;

if ( pred ){

vmax = v1 ;

idxmax = idx1;

}
``````

So the former is better.

Hello,

Thank you very much! cuobjdump is now my friend