Timing compares with OpenCL & CUDA

I have implemented cuda events and opencl events to measure CPU-GPU, GPU-CPU copy and kernel execution times. The thing that bugs me the most is that my opencl implementation shows better results than cuda implementation.

For example (using events from CUDa and OpenCL documentation)

cl_ulong start, end; clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
float executionTimeInMilliseconds = (end - start) * 1.0e-6f;

cudaEvent_t start, stop;
float time;
cudaEventRecord( start, 0 );
kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y, NUM_REPS);
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );

for number of elements 2048
CUDA CPU-GPU 0,0165979 ms, GPU-CPU 0,091427 ms, Kernel - 0,007098 and
OpenCL CPU-GPU 0,007276 ms, GPU-CPU 0,006684 ms, Kernel - 0,011754.

I tried with bigger number of elements like 114440, 2097152 etc. and opencl still shows better performance.
Literature and articles all say that CUDA offers better performance, so I’m thinking that I’m doing something wrong, what should i check?
Already checked syncronization, calculated average values… changed kernel execution settings…

Are you using pinned memory in your CUDA implementation? That generally offers more (in some cases twice) the performance in copy operations.