Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster?

Biaowang · October 11, 2013, 7:48am

Hi, I have a kernel implemented by CUDA and OpenCL, respectively.
I can make sure the kernel code are the same in application level, with same configuration for the grid and thread block, same algorithm.
However, I observe that the kernel execution time are quite different, OpenCL spend half time against CUDA.
My environment is cuda 4.1 toolkit and gcc 4.4, with nvidia driver 304.88, kubuntu 12.10.
GPU, GT430, compute capability 2.1.

I profiled both kernels with compute visual profiler, and found the global memory throughput differ a lot, But I am stucked as the applications are the same.

So I trick a little bit the option of the compiler nvcc, my first try is
nvcc -keep -I/someincludepaths -gencode=arch=compute_20,code=sm_20
and then nvcc -keep -I/someincludepaths -gencode=arch=compute_20,code=sm_21
But I can’t see any performance improvement for the change of “code” options.

As a common concept that CUDA is faster than OpenCL, I get the contradictive results.
My profile strategy are:

for OpenCL is that bind a event for kernel call, get the start and end time, and then get the difference between the two.

for CUDA also using event, following is a piece of my profiling code:

        cudaEventRecord(p->idct_event[0],p->idct_streams[0]);                                                                      
        idct_asyn_4x4_baseline_TB2<<<grid,threads,0,p->idct_streams[0]>>>(p->idct_dev_mem[0]);                                     
        cudaEventRecord(p->idct_event[1],p->idct_streams[0]);                                                                      
        cudaEventSynchronize(p->idct_event[1]);                                                                                    
        cudaEventElapsedTime(&idct_kernel_time,p->idct_event[0],p->idct_event[1]);

Anyone who can give some clue why it happens, or some strategy to trace what happens?
Best

Biaowang · October 11, 2013, 1:51pm

No one comes, sad

Tiomat · October 11, 2013, 2:12pm

Can you provide the code for a simple case that shows the difference? There are some differences between CUDA and OpenCL that require different optimisation tricks to maximise the throughput but without a simple repro case (showing the full host code, and both the OpenCL and CUDA kernel source) people will find it very difficult to give any detailed answers.

Topic		Replies	Views
Timing compares with OpenCL & CUDA CUDA Programming and Performance	1	952	June 25, 2012
Performance comparison of CUDA and OpenCL CUDA Programming and Performance	2	1085	June 3, 2016
OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k CUDA Programming and Performance	1	15567	April 14, 2011
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1526	September 12, 2018
Benchmark kernel execution time with CUDA and OpenCL How to ensure that identical kernels are benchm CUDA Programming and Performance	2	11833	May 4, 2011
CUDA / OpenCL 1.1 Comparison CUDA Programming and Performance	3	1483	December 1, 2010
CUDA OpenCL comparison CUDA Programming and Performance	9	3400	August 23, 2011
CUDA performance vs. openCL performance CUDA Programming and Performance	7	12371	June 8, 2012
Wish List for next OpenCL release CUDA Programming and Performance	9	17433	September 9, 2009
CUDA or OpenCL CUDA Programming and Performance	3	2968	January 24, 2011

Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster?

Related topics