We had some demos at PPAM (gpgpu.org/ppam2009) running the exact same code on a (then nonexistent) AMD GPU and an Intel i7. Sadly, we didn’t incorporate NVIDIA’s toolchain…
The thing about OpenCL is that it is explicitly designed low-level and middleware vendors/developers will provide infrastructure. If you write your own BLAS in CL, then you will get good scaling across CPU cores or from CPU to GPU with out-of-the-box code. You will be slightly slower than vendor-tuned codes. But so what? OpenCL provides a great leap forward. Currently, top-off-the-chart CPU performance requires hacking in ASM or SSE, and you need to change your “optimal” parameters when switching from an i7 to a Core2 to a Santa Rosa to whatever the chip is named, just because your L1 and L2 differ. Try understanding the build process of the GotoBLAS, the best performing BLAS on CPUs I am aware of. OpenCL exposes L1 essentially as shared memory in CUDA speak, for starters… On the CPU we just yell at the compiler folks. On the GPU, we hand-tune ourselves. Feels weird.
The real question here is about the future programming model, and about persuading the single thread folks to pick up parallel computing. In this respect, CUDA is OpenCL. Migrating kernels from CUDA to CL is smart copy and paste. Like any open industry standard, CL has its drawbacks and the extension model will be a mood-killer in practice (I remember incompatibilities between AMD and NVIDIA back in the OpenGL days), but so what?
That being said, I am still into CUDA and I’ll continue to be, but vendor-lock is something to consider. It was fun to do SSE for a while, and it was really cool to realize that the tiling I came up with just needs smart copy and paste to run on GPU and CPU. If you design your algorithms to be reasonably blocking or tile-based, you will be happy on any arch!