Hi All,
I have written a code as:
device divideFun(float * Array, int size)
{
int i=0;
for(i=0;i<size;++i)
Array[i] = (float)i/256.0f ;
}
this function takes X ms time.
But if i replace line
Array[i] = (float)i/256.0f ;
by
Array[i] = __fdividef((float)i, 256f) ;
then this takes (X + SOME MORE TIME) ms.
But i read in NVIDIA_CUDA_Programming_Guide_1.1.pdf that " Floating-point division takes 36 clock cycles, but __fdividef(x, y) provides a faster version at 20 clock cycles ".
So, my question is why is it so?
I also faced the same problem.
Please help us…
Take a look at the generated code - the first version is probably being optimised to a multiply.
Indeed, since you use a constant value (256.0f), the compiler defines a constant (1.0/256.0f) and then uses this constant to carry out multiplication; so, no division is carried out as such. When you use __fdividef(x,y), however, the GPU actually carries out the division operation which is slower than multiplication.
If instead this
Array[i] = (float)i/256.0f ;
you use this
Array[i] = (float)i/Array[i].
I expect, you’ll see performance drop.
Cheers,
Evghenii
Hi egaburov,
You understood wrongly. Actaully what his (My) doubt is …
If I use the below code…
device divideFun(float * Array, int size)
{
int i=0;
for(i=0;i<size;++i)
Array[i] = (float)i/256.0f ;
} then it is taking X ms time
suppose if i use the below code…
device divideFun(float * Array, int size)
{
int i=0;
for(i=0;i<size;++i)
Array[i] = __fdividef((float)i, 256f) ;
} then it is taking (X + SOME MORE TIME) ms
What could br the reason for this???
What part don’t you understand?
Your current code “Array[i] = (float)i/256.0f ;” is changed into “Array[i] = (float)i * const_X”, which is a multiplication, and much faster than your forced division by calling __fdividef().