OpenCL performs better than CUDA

I have implemented MOPSO algorithm on CUDA and now I am implementing it on OpenCL… I am getting better performance in time when I execute my program on OpenCL. I dont understand why does it happen as I am executing the code on same GPU (Quadro FX 3700)…

I’d recommend to take a look at the PTX generated by both CUDA and OpenCL and compare it … but maybe the cause is rather to be found in your host code. Maybe you’re using the OpenCL API in a way it is able to overlap copy and compute, or you’re using page-locked memory, or something like that which is probably not the case in the CUDA program.