Performance comparison of CUDA and OpenCL

Hi. I’m having kind of a problem when I compare kernel execution times in CUDA and OpenCL. Based on what I’ve read, CUDA should be a little bit faster than OpenCL. Meanwhile a simple add kernel (add two 1D arrays), runs faster on OpenCL. The code:

CUDA - Kernel
global void addKernel(int *c, const int *a, const int *b, int size)
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) {
c[i] = a[i] + b[i];
CUDA - Execution
addKernel <<<62500, 1024 >>>(dev_c, dev_a, dev_b, size);

OpenCL - Kernel
__kernel void Add(__global int *a, __global int *b, __global int *c, int size) {
int i = get_global_id(0);

if (i < size) {
	c[i] = a[i] + b[i];

OpenCL - Execution
vectorSize=64000000; localWorkSize=1024
error = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &vectorSize, &localWorkSize, 0, NULL, &event);

Array size in both cases is 64000000. Time is masured by built-in functions of CUDA and OpenCL.
Execution times are: CUDA - 8-9 ms, OpenCL - ~5 ms
GPU is GTX970, CUDA ver 7.5, OpenCL 1.2

Other kernels I’ve tested also run slower on CUDA. This is the simplest code that I’ve tested.
Am I doing something wrong? Or maybe the problem lies elsewhere. Does aynone know why I get such results?

Your addKernel invocation doesn’t even match your definition, so I’m pretty sure this is not the code you are running. If you want to provide complete code examples for both cases I will take a look.

Are you compiling a debug project or in debug mode (with -G)? That will slow things down and it’s not how you should make perf comparisons or analysis.

Sorry, I’ve added an argument, to check something while I wrote this post. Its correct now.

Ran it in Release and now execution times are very close.
Thanks a lot!