Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster?

Hi, I have a kernel implemented by CUDA and OpenCL, respectively.
I can make sure the kernel code are the same in application level, with same configuration for the grid and thread block, same algorithm.
However, I observe that the kernel execution time are quite different, OpenCL spend half time against CUDA.
My environment is cuda 4.1 toolkit and gcc 4.4, with nvidia driver 304.88, kubuntu 12.10.
GPU, GT430, compute capability 2.1.

I profiled both kernels with compute visual profiler, and found the global memory throughput differ a lot, But I am stucked as the applications are the same.

So I trick a little bit the option of the compiler nvcc, my first try is
nvcc -keep -I/someincludepaths -gencode=arch=compute_20,code=sm_20
and then nvcc -keep -I/someincludepaths -gencode=arch=compute_20,code=sm_21
But I can’t see any performance improvement for the change of “code” options.

As a common concept that CUDA is faster than OpenCL, I get the contradictive results.
My profile strategy are:

for OpenCL is that bind a event for kernel call, get the start and end time, and then get the difference between the two.

for CUDA also using event, following is a piece of my profiling code:


Anyone who can give some clue why it happens, or some strategy to trace what happens?

No one comes, sad

Can you provide the code for a simple case that shows the difference? There are some differences between CUDA and OpenCL that require different optimisation tricks to maximise the throughput but without a simple repro case (showing the full host code, and both the OpenCL and CUDA kernel source) people will find it very difficult to give any detailed answers.