Hi,
I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have
come accross some odd results. We have a code that can do its calculations in 2 ways, they are
basicaly the same except that at one point one version calls a function that does some simple
multiplications, and the other calls a different function that uses the log function to
calculate its return value.
In the first case, with the simple multiplication, the codes are less than 10% different in performance.
In the second case, where the log function is used, they are almost 30% different in performance.
I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.
If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,
the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is
currently a bit beyond me.
Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even
why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought
the opposite should be true since the openCL is compiled once the device was known.
Hi,
I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have
come accross some odd results. We have a code that can do its calculations in 2 ways, they are
basicaly the same except that at one point one version calls a function that does some simple
multiplications, and the other calls a different function that uses the log function to
calculate its return value.
In the first case, with the simple multiplication, the codes are less than 10% different in performance.
In the second case, where the log function is used, they are almost 30% different in performance.
I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.
If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,
the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is
currently a bit beyond me.
Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even
why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought
the opposite should be true since the openCL is compiled once the device was known.
I don’t really know, but it is possible that the OpenCL specification enforces higher precision requirements on the results. Try to replace the call with ‘native_log’ and see if it makes any difference.