Hi,
I found out that __fdividef(x,y) and __logf(x) is faster than floating point add and multiply operations.
I am compiling my kernel with -arch=sm_13 option and am using gtx 280 card.
More specifically, add and multiply take 8 instructions to execute while __fdividef and __logf take 4 and 5 instructions respectively. These numbers are from cudaProf.
The Programming guide says throughput of add/multiply is 8 operations per clock cycle while that of __fdividef and __logf is 1.6 and 2 operations per clock cycle respectively.
Can anyone please explain this weird behavior.
Thanks.