Benchmark kernel execution time with CUDA and OpenCL How to ensure that identical kernels are benchm

Hi forum,

I’ve implemented multiple algorithms for CUDA and ported them to OpenCL, without having to change much of the code. I’ve done benchmarks in a bigger context, including time for transfering the data to and from the GPU, and CUDA did perform better or at least as good as OpenCL.

Now I just want to benchmark the kernel execution times for both. The way I did it was to use gettimeofday right before and after the kernel invocations. Since kernel calls are nonblocking, I added a cudaThreadSynchronize() resp. a clFinish after the invocations to ensure that the kernel had finished. Now the results are in favor of OpenCL across the board. Each time, the competing kernels are called just once. I don’t have an idea as to what causes this disparity. Some things I could think of:

  • The clFinish and cudaThreadSynchronize don’t perform the same operation
  • OpenCL is in fact more efficient, and only my memory transfer in my framework is the bottleneck. OpenCL generates PTX 2.2 while CUDA uses PTX 1.4 after all.
  • A CUDA kernel call contains some hidden transfers (like small shared memory blocks in the .cu file) which for OpenCL are done manually before the kernel call.

Benchmarks were done on Linux, CUDA 3.2, OpenCL 1.0, GeForce 8600 GT. Hope I provided enough information and that I’m not missing something fundamental.

Are you seeing huge differences in performance ?

  • It’s always a good idea to do a warmup loop before the timing code, hope you did that.

  • In place of timeofday, to really see what’s happening, you can use cudaEvents and querying OpenCL command queue for timing purposes.

Do you also have a [font=“Courier New”]cudaThreadSynchronize()[/font] immediately before the first [font=“Courier New”]gettimeofday()[/font] call? The last up to 64kb of a host->device transfer can be asynchronous even when using the default the synchronous cudaMemcpy functions.