OpenCL performances on NVIDIA GTX 260 and ATI Radeon HD

enzo30980 · December 6, 2012, 9:06am

Hi, I wrote an OpenCL kernel doing the dot product between two double arrays. This is the code:
_kernel void evaluate_product(__global const double *pFirstArray, const int n,
__global const double pSecondArray, __global double pOutArray)
{
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid <size) {
double output = 0.0f;
for (int k=0; k<n; k++)
output += pLocal[k]*pSecondArray[k];
pOutArray[gid] = output;
}
}

Why this kernel took 30 ms on NVIDIA GTX 260, while on ARI Radeon HD 6900 it took less then 10 ms?
Any ideas? Or some optimization to use in kernel for NVIDIA card?
Tks

eyalhir74 · December 6, 2012, 11:45am

First of all the 6900 is supposed to be much faster than the 260. You should compare it
to the 570 or something newer than 260.

You alos dont specify the kernel dimensions.

BTW - as a side note for nVidia - maybe it would be worthwhile opening an OpenCL
dedicated forum. Those posts on opencl really interfer with the CUDA ones.
Maybe call the forum: OBSOLETE_NOT_SO_OPEN_OPENCL_FORUM :)

enzo30980 · December 6, 2012, 12:38pm

What do you mean when you say “You alos dont specify the kernel dimensions.”, it has some effect on the efficiency of computation?

eyalhir74 · December 6, 2012, 12:44pm

Of course. For example, if you open a work group of 32 threads on AMD that would be inefficient as AMD prefers multiples of 64 as opposed to nVidia, which prefer multiples of 32.
Yet another small OpenCL goodie.

enzo30980 · December 6, 2012, 2:01pm

Hi, tks for your reply.
I call kernel execution in this way:
cl_int status = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, &buffSize, 0, 0, NULL, NULL);
Where should I set the number of thread for the work group?
Sorry but I’m new with OpenCL programming :(
Tks

eyalhir74 · December 6, 2012, 7:34pm

The first zero after the buffsize should set it. since you pass zero, I guess
the runtime would choose a default one. For AMD I think the default is ok, for
nVidia maybe the default is wrong, causing bad performance.
Take a look here for more details:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueNDRangeKernel.html

eyal

enzo30980 · December 7, 2012, 9:15am

Tks eyal, :) you are really kind.
I saw the documentation of clEnqueueNDRangeKernel and it says that “local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.” So if I would like to set esplicitly this local work size, as my problem is 1 dimensional with a global work size of 329024, my NVIDIA card has : CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64, while on AMD is CL_DEVICE_MAX_WORK_GROUP_SIZE = 256, CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256.
how can I set my local work size?
As you said before should I set my local work size as multiple of 32 for NVIDIA and multiple of 64 for AMD in order to stay in 512 for NVIDIA and in 256 for AMD and in the way to cover to whole global work size? or it is enough to set the local work size to a size of int(329024/512) and int(329024/256)?
Tks again
Lorenzo

eyalhir74 · December 7, 2012, 7:41pm

I think int(329024/512) and int(329024/256) is a good start.
However this is really more a trial and error, try a few scenarios and see
if it has any effect on the performance and choose the one that yields the best.

eyal

enzo30980 · December 10, 2012, 7:14pm

Hi, I tried to set my local_work_size both to 512 and to 256, and my global_work_size to the min multiple of 512 and 256 greater then 329024 (of course I set the real size to my kernel and checked if the global id was less then it). No effect on the performance. :(
How can I improve it?

vcosta · December 29, 2012, 11:24pm

Double precision floating point is going to be slow. Do you really need double precision?
Perhaps using double4 or something similar would be quicker.

Check how many theoretical peak DP FLOPS your card can do.

Topic		Replies	Views
performance question CUDA Programming and Performance	9	9930	August 4, 2010
CL_INVALID_WORK_GROUP_SIZE with clEnqueueNDRangeKernel CUDA Programming and Performance	12	12196	April 3, 2012
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3979	February 26, 2010
Local_work_size on NVidia drivers CUDA Programming and Performance	0	577	May 20, 2011
Problem with get_global_id(1); CUDA Programming and Performance	5	3992	May 19, 2014
local / global work (group) sizes and memory limit calculations How to find out how much private mem CUDA Programming and Performance	3	21928	August 15, 2011
Questions about global and local work size CUDA Programming and Performance	23	55326	November 1, 2010
Changing the amount of threads per block does nothing, please help CUDA Programming and Performance	0	1869	February 19, 2010
Strange Division by Zero on Large Problem Sizes Large problem sizes crash OpenCL CUDA Programming and Performance	4	1822	July 21, 2010
OpenCL and Ubuntu 10.10 CUDA Programming and Performance	7	80062	January 25, 2011

OpenCL performances on NVIDIA GTX 260 and ATI Radeon HD

Related topics