OpenCL performances on NVIDIA GTX 260 and ATI Radeon HD

Hi, I wrote an OpenCL kernel doing the dot product between two double arrays. This is the code:
_kernel void evaluate_product(__global const double *pFirstArray, const int n,
__global const double pSecondArray, __global double pOutArray)
{
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid <size) {
double output = 0.0f;
for (int k=0; k<n; k++)
output += pLocal[k]*pSecondArray[k];
pOutArray[gid] = output;
}
}

Why this kernel took 30 ms on NVIDIA GTX 260, while on ARI Radeon HD 6900 it took less then 10 ms?
Any ideas? Or some optimization to use in kernel for NVIDIA card?
Tks

First of all the 6900 is supposed to be much faster than the 260. You should compare it
to the 570 or something newer than 260.

You alos dont specify the kernel dimensions.

BTW - as a side note for nVidia - maybe it would be worthwhile opening an OpenCL
dedicated forum. Those posts on opencl really interfer with the CUDA ones.
Maybe call the forum: OBSOLETE_NOT_SO_OPEN_OPENCL_FORUM :)

What do you mean when you say “You alos dont specify the kernel dimensions.”, it has some effect on the efficiency of computation?

Of course. For example, if you open a work group of 32 threads on AMD that would be inefficient as AMD prefers multiples of 64 as opposed to nVidia, which prefer multiples of 32.
Yet another small OpenCL goodie.

Hi, tks for your reply.
I call kernel execution in this way:
cl_int status = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, &buffSize, 0, 0, NULL, NULL);
Where should I set the number of thread for the work group?
Sorry but I’m new with OpenCL programming :(
Tks

The first zero after the buffsize should set it. since you pass zero, I guess
the runtime would choose a default one. For AMD I think the default is ok, for
nVidia maybe the default is wrong, causing bad performance.
Take a look here for more details:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueNDRangeKernel.html

eyal

Tks eyal, :) you are really kind.
I saw the documentation of clEnqueueNDRangeKernel and it says that “local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.” So if I would like to set esplicitly this local work size, as my problem is 1 dimensional with a global work size of 329024, my NVIDIA card has : CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64, while on AMD is CL_DEVICE_MAX_WORK_GROUP_SIZE = 256, CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256.
how can I set my local work size?
As you said before should I set my local work size as multiple of 32 for NVIDIA and multiple of 64 for AMD in order to stay in 512 for NVIDIA and in 256 for AMD and in the way to cover to whole global work size? or it is enough to set the local work size to a size of int(329024/512) and int(329024/256)?
Tks again
Lorenzo

I think int(329024/512) and int(329024/256) is a good start.
However this is really more a trial and error, try a few scenarios and see
if it has any effect on the performance and choose the one that yields the best.

eyal

Hi, I tried to set my local_work_size both to 512 and to 256, and my global_work_size to the min multiple of 512 and 256 greater then 329024 (of course I set the real size to my kernel and checked if the global id was less then it). No effect on the performance. :(
How can I improve it?

Double precision floating point is going to be slow. Do you really need double precision?
Perhaps using double4 or something similar would be quicker.

Check how many theoretical peak DP FLOPS your card can do.