Experiences and some problems porting cuda code to opencl

twofisher · April 20, 2012, 9:56am

Hello!

in the past i developped some cuda applications. This turned out to be quite successful.

Since the devices from AMD are reported to deliver more speed per buck i tried to get my hands on such a device. So last week i began porting parts of my cuda code to opencl.

Here are my experiences / problems:

Coding for cuda is much more fun and straight forward than opencl. In opencl i first had to develop a library for all the neccessary setup work before writing the first line of opencl.

I used to put constant values into constant device memory, so i tried to do that in the ported code, too. Unfortunately, i hit some problems:

My code uses two kernels k1 and k2. The fast kernel k1 produces some intermediate values from input data. For the second (long running) kernel k2, these intermediate data (pre-computed arrays) are constant. So the cuda code uses cudaMemcpyToSymbol between the calls to write the produced values into the constant arrays on the device. Unfortunately, i have no idea how to do that in opencl.

The only solution i came up with is to dynamically produce opencl code and compile that before running k2. This is no real solution, since k1 and k2 are run inside a large loop with some 100.000 iterations.

Thus i dropped the idea using constant memory for that and the code now uses global memory.

The second problem is, that i really dislike the following aspect of the opencl design:

Each pointer in opencl code must be declared for the type of memory it points to.

My code has a large function f (about 1000 lines of heavily unrolled code), which works on a small array.

Different parts of the code call that function for different inputs:

result = f(ptr_to_global_array),

result = f(ptr_to_local_memory) and

result = f(ptr_to_constant_memory).

Am i really supposed to duplicate f’s code with the only change in the declaration of the memory type it works on? I really hate the idea of converting the code into a macro…

This really should be the compiler’s task. When it desides to not inline the function, than it should produce three different versions of the function’s code itself!

Since there are no function pointers in opencl and the complete code has to be self contained (i.e. no libs) it should be easy for the compiler to determine which type of memory the function uses in each part of the code. The cuda compiler does manage that, too - at least in all obvious cases!

The entry point of the device code (i.e. the kernels) can contain only pointers to global memory, anyway. Or is there any possibility to construct a kernel, which has pointer args to local (=shared), constant or private memory?

Some things i have to know better to trust my code:

[*]What’s the difference between clFinish(queue) and clFlush(queue)?

[*]If i use clEnqueueWriteBuffer into a queue (non blocking version) and later i call a kernel in the same queue (non blocking, either), does opencl guarantee, that the buffer has been filled up completely before the kernel starts?

[*]Do i have to use clReleaseEvent for evt at the end of the following code?

cl_event evt;

call kernel(...., &evt);

query some info from evt (especially profiling info)

// do i need clReleaseEvent for evt, here?

What i really like better in opencl than cuda is the handling of events for profiling. This seems to be more straight forward to me.

The resulting first opencl version of my code performs a bit worse than the cuda version, but all in all it seems to work fine, allthough it’s a lot more fun coding in cuda.

Is there any project implementing cuda for amd devices? (o.k., this might be the wrong question in this forum.)

Thanks for any hints in advance!