Hi,

I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have

come accross some odd results. We have a code that can do its calculations in 2 ways, they are

basicaly the same except that at one point one version calls a function that does some simple

multiplications, and the other calls a different function that uses the log function to

calculate its return value.

In the first case, with the simple multiplication, the codes are less than 10% different in performance.

In the second case, where the log function is used, they are almost 30% different in performance.

I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.

If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,

the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is

currently a bit beyond me.

Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even

why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought

the opposite should be true since the openCL is compiled once the device was known.

–

jason