OpenCL Performance benchmarking and comparative analysis


I just tested nBody sample application from NVIDIA OpenCL API. I didn’t explore the source code, but does anybody know the reason why the performance is worst?


I don’t know, but I noticed the bandwidth measurements to achieve way lower results then the cuda version. This might cause some performance loss. After all, this is still alpha software. I guess future version will give similar performance as the CUDA implementation.

I haven’t tried the demo projects yet (on a non NVIDIA GPU computer right now).

But according to the release notes, the OpenCL - OpenGL integration is not working at this time, so I guess NBODY is copying results to OpenGL through main memory.

That could be the cause for the bad performance.

Can someone confirm this?


here’s my experience with the bandwidth on my (oooold) geforce 8600GS:

the only way to achieve the whole bandwidth is to use the following kernel:

[codebox]__kernel void kernel_mult_blocks_float4s(__global float buffer[GLOBAL_WORK_SIZE_D0])


    const unsigned int gid0 = get_global_id(0);

    const unsigned int gid1 = get_global_id(1);

#if 0


    buffer[gid1][gid0*4+0] = buffer[gid1][gid0*4+0]*buffer[gid1][gid0*4+0];

    buffer[gid1][gid0*4+1] = buffer[gid1][gid0*4+1]*buffer[gid1][gid0*4+1];

    buffer[gid1][gid0*4+2] = buffer[gid1][gid0*4+2]*buffer[gid1][gid0*4+2];

    buffer[gid1][gid0*4+3] = buffer[gid1][gid0*4+3]*buffer[gid1][gid0*4+3];



    buffer[gid1][gid0+0*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+0*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+0*(GLOBAL_WORK_SIZE_D0/4)];

    buffer[gid1][gid0+1*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+1*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+1*(GLOBAL_WORK_SIZE_D0/4)];

    buffer[gid1][gid0+2*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+2*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+2*(GLOBAL_WORK_SIZE_D0/4)];

    buffer[gid1][gid0+3*(GLOBAL_WORK_SIZE_D0/4)] = buffer[gid1][gid0+3*(GLOBAL_WORK_SIZE_D0/4)]*buffer[gid1][gid0+3*(GLOBAL_WORK_SIZE_D0/4)];



the following kernel achieves only half of it:

[codebox]__kernel void kernel_mult_blocks(__global float buffer[GLOBAL_WORK_SIZE_D0])


    const unsigned int gid0 = get_global_id(0);

    const unsigned int gid1 = get_global_id(1);

buffer[gid1][gid0] = buffer[gid1][gid0]*buffer[gid1][gid0];


the best bandwidth was achieved using 128 or 256 computation units per group.

does anyone know the reasons why the bandwidth is so slow in the second case? old graphic card? wrong kernel?

OpenCL’s memory performance on NVIDIA is all over the place at the moment.

My Mac Pro (running Linux) gets about 7GB/s device<->device with oclBandwidthTest on a GT 120.

My PC workstation (running Linux) gets about 27GB/s on a GTX 260. Under CUDA’s bandwidthTest I get 93GB/s.

i also figured out the performance issues comparing cuda with opencl programs. opencl programs run very slow.

@nvidia: when do you expect to reach the same performance for opencl programs as for cuda programs?