I’ve implemented multiple algorithms for CUDA and ported them to OpenCL, without having to change much of the code. I’ve done benchmarks in a bigger context, including time for transfering the data to and from the GPU, and CUDA did perform better or at least as good as OpenCL.
Now I just want to benchmark the kernel execution times for both. The way I did it was to use gettimeofday right before and after the kernel invocations. Since kernel calls are nonblocking, I added a cudaThreadSynchronize() resp. a clFinish after the invocations to ensure that the kernel had finished. Now the results are in favor of OpenCL across the board. Each time, the competing kernels are called just once. I don’t have an idea as to what causes this disparity. Some things I could think of:
- The clFinish and cudaThreadSynchronize don’t perform the same operation
- OpenCL is in fact more efficient, and only my memory transfer in my framework is the bottleneck. OpenCL generates PTX 2.2 while CUDA uses PTX 1.4 after all.
- A CUDA kernel call contains some hidden transfers (like small shared memory blocks in the .cu file) which for OpenCL are done manually before the kernel call.
Benchmarks were done on Linux, CUDA 3.2, OpenCL 1.0, GeForce 8600 GT. Hope I provided enough information and that I’m not missing something fundamental.