OpenCL Vs CUDA performance

I am comparing the performance of the MatrixMul SDK example provided in the CUDA and OpenCL SDKs. The OpenCL version is 5-6X slower after normalizing the matrix sizes. Here’s my configuration: ION, Linux 32 driver 190.29, CUDA toolkit and SDK 2.3, GPUComputing SDK 2.3a .

Is this expected, or am I doing something wrong?

I was at Nvidia’s GPU Developer’s conference and is several of the OpenCL classes the presenter was given several questions about the performance difference between cuda and OpenCL. As you and most everyone can tell the demos are much slow in OpenCL. The presenter who I believe was on the development team was very clear that any performance difference was simply because they team hadn’t had as much time to optimize the code for OpenCL because they have been working on it for less time. He also added that internal development versions of OpenCL were identical in performance to the latest CUDA implementations. After people asked more questions about the performance he clearly looked frustrated as said again the same thing and said they are literally identical performance between the two. When you look at the fact that the OpenCL spec was written specifically with CUDA in mind and you look at the very very similar specifications it seems to me that there would be no reason one would run faster than the other once they are both fully optimized and it sounds like they already have an internal version that is just as fast as the one for CUDA.

From some simple tests that I ran, I think the issue comes about because global/local ids/sizes are stored in per-thread global memory instead of registers.

I suppose for CUDA these ids and sizes are programmed into registers by the hardware unit that assigned blocks to SMs. Just my speculation.