OpenCL performs better than CUDA

I have implemented MOPSO algorithm on CUDA and now I am implementing it on OpenCL… I am getting better performance in time when I execute my program on OpenCL. I dont understand why does it happen as I am executing the code on same GPU (Quadro FX 3700)…

How similar are your implementations, are they really as similar as possible? Have you checked out the compute profiler output?

We made multiple experiments comparing C4CUDA and OpenCL performance, and have hardly noticed any difference in most cases. This is different if you make use of a specific feature, like the more generic texture concept in C for CUDA. Quite strangely, C4CUDA seems to require one more register than OpenCL. In some cases, this can lead to a better utilization (an additional block/workgroup on a multi-processor/compute unit), and an significant advantage for OpenCL.

Thank you… that might help… Yes, implementations are same. what is compute profile output?.. how to check?.. what can be inferred fron it?.

The compute profiler is part of the CUDA SDK and can be used to get insights how OpenCL and C4CUDA code performs. (I THINK since SDK 3.1, until 3.0 there were separate programs for profiling C4CUDA and OpenCL called CUDA Visual Profiler and OpenCL Visual Profiler.)

If you are using Linux (or MacOS, where it hardly works), you can find it in /usr/local/cuda/computeprof/bin/computeprof. Unfortunately, you will have to add /usr/local/cuda/computeprof/bin manually to the LD_LIBRARY_PATH - it will most likely crash if it is not using the bundled QT libraries, but those of your distribution. For Windows, you should find it in All Programs / NVIDIA Corporation / CUDA Toolkit.

where can we get the specification of “extra register”?.. I want to read about it more…