Maximum, branch prediction

Hello,

I’m currently implementing the classic parallel reduction kernel to find the maximum of an array along with its index.

Then, I use fmaxf() to compute the maximum at each step, but, as I want the associated index, I need do perform another test.

I wondered what is the best way to do it without introducing branch divergences.

If I write :

vmax = fmaxf(v1, v2);

idxmax = v1 > v2 ? idx1 : idx2;

is it equivalent to :

vmax = fmaxf(v1, v2);

if(v1 > v2)

	idx = idx1;

else

	idx = idx2;

?

I’ve read some things about “branch prediction”, but nothing precise. If the previous code is “branch predicted”, why not write (without using fmaxf anymore) :

if(v1 > v2) {

	idx = idx1;

	vmax = v1;

}

else {

	idx = idx2;

	vmax = v2;

}

if it is “branch predicted”, it seems that it will be faster than the version with fmaxf.

What is the “limit” of branch prediction ?

What would you advise me to do ?

Thank you !

If you use cuobjdump to check assembly code, then compiler translate

vmax = fmaxf(v1, v2);

idxmax = v1 > v2 ? idx1 : idx2;

to 4 instructions

float vmax = fmaxf(v1, v2);

bool pred = v1 > v2 ;

int idxmax = idx2;

if ( pred ){

   idxmax = idx1 ;

}

However

if(v1 > v2) {        

    idx = idx1;        

    vmax = v1;

}else {        

    idx = idx2;        

    vmax = v2;

}

is translated to 5 instructions

bool pred = v1 > v2 ;

vmax = v2;

idxmax = idx2;

if ( pred ){

   vmax = v1 ;

   idxmax = idx1;   

}

So the former is better.

Hello,

Thank you very much! cuobjdump is now my friend