Problem on __fdividef().

Hi All,

I have written a code as:

device divideFun(float * Array, int size)

 {

 int i=0;

for(i=0;i<size;++i)
Array[i] = (float)i/256.0f ;
}

this function takes X ms time.

But if i replace line

            Array[i] = (float)i/256.0f ;

by
Array[i] = __fdividef((float)i, 256f) ;

then this takes (X + SOME MORE TIME) ms.

But i read in NVIDIA_CUDA_Programming_Guide_1.1.pdf that " Floating-point division takes 36 clock cycles, but __fdividef(x, y) provides a faster version at 20 clock cycles ".

So, my question is why is it so?

I also faced the same problem.
Please help us…

Take a look at the generated code - the first version is probably being optimised to a multiply.

Indeed, since you use a constant value (256.0f), the compiler defines a constant (1.0/256.0f) and then uses this constant to carry out multiplication; so, no division is carried out as such. When you use __fdividef(x,y), however, the GPU actually carries out the division operation which is slower than multiplication.

If instead this

Array[i] = (float)i/256.0f ;

you use this

Array[i] = (float)i/Array[i].

I expect, you’ll see performance drop.

Cheers,

Evghenii

Hi egaburov,

You understood wrongly. Actaully what his (My) doubt is …

If I use the below code…
device divideFun(float * Array, int size)

{

int i=0;
for(i=0;i<size;++i)
Array[i] = (float)i/256.0f ;
} then it is taking X ms time

suppose if i use the below code…
device divideFun(float * Array, int size)

{

int i=0;
for(i=0;i<size;++i)
Array[i] = __fdividef((float)i, 256f) ;
} then it is taking (X + SOME MORE TIME) ms

What could br the reason for this???

What part don’t you understand?

Your current code “Array[i] = (float)i/256.0f ;” is changed into “Array[i] = (float)i * const_X”, which is a multiplication, and much faster than your forced division by calling __fdividef().

Good catch!