__fdividef(x,y) and __logf(x) faster than floating point add.


I found out that __fdividef(x,y) and __logf(x) is faster than floating point add and multiply operations.
I am compiling my kernel with -arch=sm_13 option and am using gtx 280 card.
More specifically, add and multiply take 8 instructions to execute while __fdividef and __logf take 4 and 5 instructions respectively. These numbers are from cudaProf.
The Programming guide says throughput of add/multiply is 8 operations per clock cycle while that of __fdividef and __logf is 1.6 and 2 operations per clock cycle respectively.

Can anyone please explain this weird behavior.

Are you referring to double-precision add and multiply?
If not, check that you are not using double-precision literals in your code (like 3.14 instead of 3.14f).