Performance of Double Precision log function

Hi,
I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have
come accross some odd results. We have a code that can do its calculations in 2 ways, they are
basicaly the same except that at one point one version calls a function that does some simple
multiplications, and the other calls a different function that uses the log function to
calculate its return value.

In the first case, with the simple multiplication, the codes are less than 10% different in performance.
In the second case, where the log function is used, they are almost 30% different in performance.

I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.

If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,
the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is
currently a bit beyond me.

Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even
why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought
the opposite should be true since the openCL is compiled once the device was known.


jason

Hi,
I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have
come accross some odd results. We have a code that can do its calculations in 2 ways, they are
basicaly the same except that at one point one version calls a function that does some simple
multiplications, and the other calls a different function that uses the log function to
calculate its return value.

In the first case, with the simple multiplication, the codes are less than 10% different in performance.
In the second case, where the log function is used, they are almost 30% different in performance.

I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.

If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,
the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is
currently a bit beyond me.

Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even
why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought
the opposite should be true since the openCL is compiled once the device was known.


jason

I don’t really know, but it is possible that the OpenCL specification enforces higher precision requirements on the results. Try to replace the call with ‘native_log’ and see if it makes any difference.

Unfortunately native_log only works for float, not double.

jason