OpenCL vs Cuda performance on same kernels

Hello,

I have recently tried to port the simulation program I am working on from Cuda to OpenCL and it became about ten times slower. Kernels there do not have any Cuda-specific optimizations, just a bunch of elementwise additions, multiplications and trigonometric functions. I checked the performance of trigonometric functions alone, it seems to be fine. All the testing is performed on Ubuntu 9.10 x64, Cuda 3, nv drivers 195.17

Another example is my Python FFT library (http://pypi.python.org/pypi/pyfft), which can work both with Cuda and OpenCL, and whose code initially came from OpenCL FFT implementation (Apple’s). Despite this fact, its Cuda version works faster (2 to 10 times, depends on problem size) - if you have python, pyopencl and pycuda installed on your system, you can check it yourself.

So, has anyone recently tested performance for the same kernels on OpenCL and Cuda? Can I just blame the OpenCL driver, which is not as mature as Cuda driver is, or do I need to look better for bottlenecks in my code? Maybe anyone has some tips on common issues which appear after such porting (because Cuda and OpenCL APIs look totally similar, and it may hide some internal discrepancies between them).

I understand that my question sounds rather vague, because I just cannot point my finger on the problem and say “this thing is being a bottleneck”. Parts of the system seem to work fast, and the whole system just does not. Anyway, thanks in advance, I will appreciate any advice.

Did you use ‘-cl-mad-enable’? Would you be able to post your kernel code?

Can you decompose the execution time of your program?

For me, since CUDA 3.0, I have a problem in my OpenCL FFT. The “compilation time” of the kernel code is 10 (or more) times longer than before and I dont know why…
So it’s not a computing time problem.

I use ‘-cl-mad-enable’ but when I drop it its the same.

With ‘-cl-mad-enable’ or without it, results are approximately the same (as compared to OpenCL-Cuda difference, at least).

As for the kernels - here’s an example (it is the template, hence all these ${…}):

${p.complex.name} complex_mul_scalar(${p.complex.name} a, ${p.scalar.name} b)

{

	return ${p.complex.ctr}(a.x * b, a.y * b);

}

${p.complex.name} complex_mul(${p.complex.name} a, ${p.complex.name} b)

{

	return ${p.complex.ctr}(mad(-a.y, b.y, a.x * b.x), mad(a.y, b.x, a.x * b.y));

}

${p.scalar.name} squared_abs(${p.complex.name} a)

{

	return a.x * a.x + a.y * a.y;

}

${p.complex.name} cexp(${p.complex.name} a)

{

	${p.scalar.name} module = exp(a.x);

	${p.scalar.name} angle = a.y;

	return ${p.complex.ctr}(module * native_cos(angle), module * native_sin(angle));

}

float get_float_from_image(read_only image3d_t image, int i, int j, int k)

{

	sampler_t sampler = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP |

		CLK_NORMALIZED_COORDS_FALSE;

	uint4 image_data = read_imageui(image, sampler,

		(int4)(i, j, k, 0));

	return *((float*)&image_data);

}

#define DEFINE_INDEXES int i = get_global_id(0), j = get_global_id(1), k = get_global_id(2), index = (k << ${c.nvx_pow + c.nvy_pow}) + (j << ${c.nvx_pow}) + i

// Propagates state vector in x-space for evolution calculation

__kernel void propagateXSpaceTwoComponent(__global ${p.complex.name} *aa,

	__global ${p.complex.name} *bb, ${p.scalar.name} dt,

	read_only image3d_t potentials)

{

	DEFINE_INDEXES;

	${p.scalar.name} V = get_float_from_image(potentials, i, j, k % ${c.nvz});

	${p.complex.name} a = aa[index];

	${p.complex.name} b = bb[index];

	//store initial x-space field

	${p.complex.name} a0 = a;

	${p.complex.name} b0 = b;

	${p.complex.name} pa, pb, da = ${p.complex.ctr}(0, 0), db = ${p.complex.ctr}(0, 0);

	${p.scalar.name} n_a, n_b;

	//iterate to midpoint solution

	%for iter in range(c.itmax):

		n_a = squared_abs(a);

		n_b = squared_abs(b);

		// TODO: there must be no minus sign before imaginary part,

		// but without it the whole thing diverges

		pa = ${p.complex.ctr}(

			-(${c.l111} * n_a * n_a + ${c.l12} * n_b) / 2,

			-(-V - ${c.g11} * n_a - ${c.g12} * n_b));

		pb = ${p.complex.ctr}(

			-(${c.l22} * n_b + ${c.l12} * n_a) / 2,

			-(-V - ${c.g22} * n_b - ${c.g12} * n_a + ${c.detuning}));

		// calculate midpoint log derivative and exponentiate

		da = cexp(complex_mul_scalar(pa, (dt / 2)));

		db = cexp(complex_mul_scalar(pb, (dt / 2)));

		//propagate to midpoint using log derivative

		a = complex_mul(a0, da);

		b = complex_mul(b0, db);

	%endfor

	//propagate to endpoint using log derivative

	aa[index] = complex_mul(a, da);

	bb[index] = complex_mul(b, db);

}

It is the part of this program, file evolution.py. It was ported to OpenCL a few days ago. As you can see, no optimization magic at all, just an elementwise kernel. Tried to get rid of reading from image (used in-place calculation) - did not change the performance.

Other example is my pyfft module which has performance tests for both OpenCL and Cuda, and uses same kernels for both of them. They are pretty big, and located in file pyfft/kernel.mako (it is the template too).

And, according to profiler, Python overhead does not seem to play some significant role. PyCuda and PyOpenCL are very thin and fast wrappers of corresponding APIs.

I am sure I am measuring the time of the execution. But I agree that the compilation is pretty slow too.

Yep, my current hypothesis is that there is some driver overhead. I think I already ruled out any possible computation bottlenecks. Probably I’ll try to install Cuda 2.3 back and see what will happen.

After reading this post I decided to mess around with the difference between CUDA and OpenCL with a simple kernel that just squared every element in an array. Here is what I did/found, if it means anything or not, well that’s why I am posting it.

Conditions:

  1. copy 4MB of memory 10,000 times from host to device, then device to host (one complete swap was one cycle, I changed this to 100,000 times but I’m not patient enough to let it finish)
  2. run the exact same kernel (line for line virtually) 1,000 times.

Results:

  1. OpenCL and CUDA were about the same as far as speed. They alternated being fastest, but were always <5ms within eachother (total time differed based on load)
  2. OpenCL was significantly SLOWER than CUDA, by at least a factor of 10 or more. The OCL kernel time could be changed dramatically by changing the global and local worksizes of the kernel, but were no where near the speed of CUDA. I can post sample code and results later when I have it with me.

An example of some of the times for the kernels were OCL being 190ms for 1000 iterated kernels, and CUDA being about 9-13ms for the same loop. However, if a single kernel was ran then OCL was faster with a time of around 6 ms and CUDA being about 9ms.

This is very interesting :)

edit: should probably mention this was run on a GTX275 896Mb card.

I suggest trying a newer driver. I’m using 195.62 on my laptop and I’m getting near identical speeds for CUDA and OpenCL. OpenCL is a small fraction slower. OpenCL used to be magnitudes smaller on the same kernel but sped up once I updated the drivers and went to 3.0 beta.

I installed 195.36.15 (released March 19, 2010, most recent one for Linux x64) and the only thing I got is that now it takes OpenCL several minutes to compile each of my kernels, instead of several seconds with the previous driver version. Performance remains the same.

I still think it’s not the kernels that are slow, but something is wrong with queue management either in driver or in my program (all of my kernels must be executed successively, so I just push all of them to queue and wait for the finish – maybe it’s not the best way).

I agree with that. There was not problem with the CUDA 2.3 version.

Just look maybe the two functions :clEnqueueReadBuffer and clEnqueueWriteBuffer

In the SDK 2.3 the third argument to block the queue has not impact and examples on SDK put always CL_TRUE

But since the 3.0 the third argument has an impact and in new examples of the SDK you will see CL_FALSE.

Otherwise, i have see a problem of queue between 2.3 and 3.0 but the problem always exists (example of several kernels with ReadBuffer and WriteBuffer), to put in evidence this problem i have made a very simple kernel with a huge loop on it (i launch 1000 times this kernel, and i put on it a profiler for the queue)

Thanks

J

Oh yes, I worked on the same FFT kernel ported from Apple’s codebase as well. The best performance I got (after tuning the kernel parameters for a while) for batched 1D FFTs of the size 512/1024/2048 is around 100GFLOPS (on-board, excluding memory manipulation), while the corresponding CUDA version has claimed over 300GFLOPS. BTW, the card I am talking about is GTX 280, and the GFLOPS is measured using 5Nlog2(N)/T

I have the same problem with you. My OpenCL device is Nvidia GPU and OS is windows.I think the macro fftKernel8and complexMul cause the problem.When I delete the macro in the code.It complies fast. And when I compile a program the same length as this one without macros, it also complies fast.So I think the macro cause the problem.I don’t konw whether other OpenCL devices also compile so slowly when there are macros.

I have the same problem with you. My OpenCL device is Nvidia GPU and OS is windows.I think the macro fftKernel8and complexMul cause the problem.When I delete the macro in the code.It complies fast. And when I compile a program the same length as this one without macros, it also complies fast.So I think the macro cause the problem.I don’t konw whether other OpenCL devices also compile so slowly when there are macros.