Performance of Double Precision log function

jasno · November 26, 2010, 1:33pm

Hi,
I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have
come accross some odd results. We have a code that can do its calculations in 2 ways, they are
basicaly the same except that at one point one version calls a function that does some simple
multiplications, and the other calls a different function that uses the log function to
calculate its return value.

In the first case, with the simple multiplication, the codes are less than 10% different in performance.
In the second case, where the log function is used, they are almost 30% different in performance.

I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.

If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,
the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is
currently a bit beyond me.

Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even
why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought
the opposite should be true since the openCL is compiled once the device was known.

–
jason

jasno · November 26, 2010, 1:33pm

Hi,
I have been porting some CUDA code to OpenCL to look at the performance tradeoff and have
come accross some odd results. We have a code that can do its calculations in 2 ways, they are
basicaly the same except that at one point one version calls a function that does some simple
multiplications, and the other calls a different function that uses the log function to
calculate its return value.

In the first case, with the simple multiplication, the codes are less than 10% different in performance.
In the second case, where the log function is used, they are almost 30% different in performance.

I am using tookit 3.1 driver 256.40 on a GTX285 for both codes.

If I look at the PTX codes for both cases I can see that it is clearly doing the log function differently,
the CUDA version has some .MAD instructions and the openCL has some .FMA instructions but the PTX is
currently a bit beyond me.

Does anyone know why the openCL implementation of log should be so much worse than the CUDA version ? Or even
why they should be different ? Is it just CUDA optimising for the hardware better than the openCL ? I thought
the opposite should be true since the openCL is compiled once the device was known.

–
jason

Martin_Nilsson · November 29, 2010, 12:09pm

I don’t really know, but it is possible that the OpenCL specification enforces higher precision requirements on the results. Try to replace the call with ‘native_log’ and see if it makes any difference.

jasno · November 29, 2010, 12:19pm

Unfortunately native_log only works for float, not double.

–

jason

Topic		Replies	Views
Faster and more accurate implementation of logf() CUDA Programming and Performance	10	2386	August 7, 2024
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7802	January 28, 2018
OpenCL performs better than CUDA CUDA Programming and Performance	1	545	March 1, 2011
Float accuracy : OpenCL and CUDA CUDA Programming and Performance	4	3531	August 5, 2010
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1653	September 12, 2018
Significant speedup of OpenCL vs CUDA CUDA Programming and Performance	23	10212	February 12, 2022
OpenCL runs faster than CUDA and PTX version weirdness.... CUDA Programming and Performance	2	2640	March 4, 2010
OpenCL vs Cuda performance on same kernels CUDA Programming and Performance	13	55931	July 15, 2010
CUDA performance vs. openCL performance CUDA Programming and Performance	7	12607	June 8, 2012
OpenCL Vs CUDA performance CUDA Programming and Performance	2	42082	November 8, 2009

Performance of Double Precision log function

Related topics