CUDA performance vs. openCL performance


I just started with GPGPU and I’m trying to decide whether to use CUDA or openCL.

I have a ‘GeForce GTX 560’ and running on i7 Intel CPU with enough DDR.
I downloaded the ‘NVIDIA GPU Computing SDK 4.1’ and I run some of the examples form the CUDA samples vs. the openCL samples. All examples show that openCL runs faster. can someone explain this to me? why does same algorithem run faster when under openCL than CUDA? Below are some of the examples, taken directly from the NVIDIA SDK:

convolutionSeparable (CUDA) prints:

convolutionSeparable, Throughput = 297.2058 MPixels/sec, Time = 0.03175 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0

oclConvolutionSeparable (openCL) prints:

oclConvolutionSeparable, Throughput = 1877.6873 MPixels/s, Time = 0.00503 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0

Histogram (CUDA) prints:

histogram64, Throughput = 8468.7535 MB/s, Time = 0.00792 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64

oclHistogram (openCL) prints:

oclHistogram64, Throughput = 20750.2322 MB/s, Time = 0.00323 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64


Here are some other interesting benchmarks on this topic.

These are my results on OpenSUSE 12.1 with GTX570 using Nvidia beta driver 302.07.

CUDA C: convolutionSeparable, Throughput = 3891.8228 MPixels/sec, Time = 0.00242 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0
OpenCL: oclConvolutionSeparable, Throughput = 3655.7091 MPixels/s, Time = 0.00258 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0

CUDA C: histogram256, Throughput = 19842.9523 MB/s, Time = 0.00338 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192
OpenCL: oclHistogram256, Throughput = 20138.0799 MB/s, Time = 0.00333 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192

The performance of CUDA and OpenCL are very close. I don’t know why you got so huge differences.

Hi Melonakos, the link you posted was very helpful and I found a lot of useful information over there, thanks a lot!

ZHAO Peng, your timings suggest that I have some issue that cause my CUDA versions to run very slow. I am currently investigating this issue. Thank you for that.

Is nVidia GPU performance the key point for choosing between CUDA and Opencl?

I mean, if you are limited to nVidia GPU for your project, don’t plan to use SSE or AVX units on CPU implementations of OpenCL, nor Ivy Bridge GPU, nor even the new AMD Radeon GCN architecture, that is real fast on OpenCL, maybe asking how much performance you will get on OpenCL and CUDA is relevant.

CUDA has some advantages over OpenCL, on performance, features, on nVidia GPU. And have none on any other platform, including real-world performance on complex algorithms!
I think that being slower of maybe 5% or 10% isn’t a problem (on the worst case), since it enable my client to choose between all OpenCL-enabled devices, it wasn’t mandatory last year, but now it is since we could compare architectures and results… (don’t want to promote one brand after another especially in this forum)

Is AMD’s OpenCL implementation poor? If they are good, then we should all jump ship to AMD, right? :confused:

According to some benchmark results the performance of AMD’s OpenCL implementation is pretty good. What more important is that AMD is very active on the support of OpenCL. They have released the first OpenCL 1.2 implementation.

However, the quality of AMD 's graphics driver is so poor especially on Linux that I wouldn’t buy their products without very good reason. Besides, AMD should present more clear GPU computing strategy and build the community like Nvidia.

– deleted