OpenCL Performance benchmarking and comparative analysis

Wendell · May 14, 2009, 9:32am

Hi.

I just tested nBody sample application from NVIDIA OpenCL API. I didn’t explore the source code, but does anybody know the reason why the performance is worst?

Thanks.

theMarix · May 14, 2009, 1:50pm

I don’t know, but I noticed the bandwidth measurements to achieve way lower results then the cuda version. This might cause some performance loss. After all, this is still alpha software. I guess future version will give similar performance as the CUDA implementation.

maolimu · May 19, 2009, 5:29pm

I haven’t tried the demo projects yet (on a non NVIDIA GPU computer right now).

But according to the release notes, the OpenCL - OpenGL integration is not working at this time, so I guess NBODY is copying results to OpenGL through main memory.

That could be the cause for the bad performance.

Can someone confirm this?

Mark

initram · May 29, 2009, 2:37pm

here’s my experience with the bandwidth on my (oooold) geforce 8600GS:

the only way to achieve the whole bandwidth is to use the following kernel:

[codebox]__kernel void kernel_mult_blocks_float4s(__global float buffer[GLOBAL_WORK_SIZE_D0])

{

    const unsigned int gid0 = get_global_id(0);

    const unsigned int gid1 = get_global_id(1);

#if 0

    // SLOOOOOOW

    buffer[gid1][gid0*4+0] = buffer[gid1][gid0*4+0]*buffer[gid1][gid0*4+0];

    buffer[gid1][gid0*4+1] = buffer[gid1][gid0*4+1]*buffer[gid1][gid0*4+1];

    buffer[gid1][gid0*4+2] = buffer[gid1][gid0*4+2]*buffer[gid1][gid0*4+2];

    buffer[gid1][gid0*4+3] = buffer[gid1][gid0*4+3]*buffer[gid1][gid0*4+3];

#else

    // FAST VERSION

    buffer[gid1][gid0+0*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+0*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+0*(GLOBAL_WORK_SIZE_D0/4)];

    buffer[gid1][gid0+1*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+1*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+1*(GLOBAL_WORK_SIZE_D0/4)];

    buffer[gid1][gid0+2*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+2*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+2*(GLOBAL_WORK_SIZE_D0/4)];

    buffer[gid1][gid0+3*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+3*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+3*(GLOBAL_WORK_SIZE_D0/4)];

#endif

}[/codebox]

the following kernel achieves only half of it:

[codebox]__kernel void kernel_mult_blocks(__global float buffer[GLOBAL_WORK_SIZE_D0])

{

    const unsigned int gid0 = get_global_id(0);

    const unsigned int gid1 = get_global_id(1);

buffer[gid1][gid0] = buffer[gid1][gid0]*buffer[gid1][gid0];

}[/codebox]

the best bandwidth was achieved using 128 or 256 computation units per group.

does anyone know the reasons why the bandwidth is so slow in the second case? old graphic card? wrong kernel?

jcornwall · May 29, 2009, 3:01pm

OpenCL’s memory performance on NVIDIA is all over the place at the moment.

My Mac Pro (running Linux) gets about 7GB/s device<->device with oclBandwidthTest on a GT 120.

My PC workstation (running Linux) gets about 27GB/s on a GTX 260. Under CUDA’s bandwidthTest I get 93GB/s.

initram · June 9, 2009, 11:00pm

i also figured out the performance issues comparing cuda with opencl programs. opencl programs run very slow.

@nvidia: when do you expect to reach the same performance for opencl programs as for cuda programs?

Topic		Replies	Views
Cuda -> OpenGL bandwidth CUDA Programming and Performance	6	3243	August 21, 2008
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3981	February 26, 2010
CUDA performance vs. openCL performance CUDA Programming and Performance	7	12372	June 8, 2012
performance question CUDA Programming and Performance	9	9933	August 4, 2010
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1527	September 12, 2018
Question about nbody.exe vs oclnbody.exe speed CUDA Programming and Performance	1	2205	October 15, 2009
OpenCL vs Cuda C performance - nBody sample nbody sample for Cuda C much faster than OpenCL CUDA Programming and Performance	2	6383	September 30, 2009
OpenCL kernel vs CUDA kernel why so different? I see very different performance for almost similar k CUDA Programming and Performance	1	15567	April 14, 2011
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1218	October 11, 2013
Regression? NVIDIA OpenCL ICD stops working in Ubuntu 22.04 CUDA Programming and Performance ubuntu , opencl , driver	3	3327	April 19, 2023

OpenCL Performance benchmarking and comparative analysis

Related topics