Opencl Global work size

Valmass · December 19, 2010, 1:46pm

Hi,

I have a problem whit Compute Visual Profiler:
I have a strange result when i run this kernel.
I use a global vector [318,319,1]
Error =clEnqueueNDRangeKernel(Queue,Kernel,3,NULL,global,NULL,0,NULL,&device_exec);
I leave to opencl implementation the task to choose the work group size.
My kernel code is:

__kernel void Confronto(__unsigned *out)
{
int gidx,gidy,index;
gidx=get_global_id(0);
gidy=get_global_id(1);
index = gidy * get_global_size(0) + gidx;
if (index == 0 ){
out[0]=get_global_size(0);
out[1]=get_global_size(1);
out[2]=get_global_size(2);

}
Printed result:

out[0] → 318
out[1] → 319
out[2] → 1

But when i use profiler the column NDrange size is:

Profiler Result:
NDRange size : [1 319]
Work group size [318 1 1]

Could you please help me understand what happens?
Thanks in advance.

Spoo · December 22, 2010, 11:37pm

Hi,

I have a problem whit Compute Visual Profiler:

I have a strange result when i run this kernel.

I use a global vector [318,319,1]

Error =clEnqueueNDRangeKernel(Queue,Kernel,3,NULL,global,NULL,0,NULL,&device_exec);

I leave to opencl implementation the task to choose the work group size.

My kernel code is:

__kernel void Confronto(__unsigned *out)

{
  int gidx,gidy,index;

  gidx=get_global_id(0);

  gidy=get_global_id(1);

  index = gidy *  get_global_size(0) + gidx;

  if (index == 0 ){

  out[0]=get_global_size(0);

  out[1]=get_global_size(1);

  out[2]=get_global_size(2);
}

Printed result:

out[0] → 318

out[1] → 319

out[2] → 1

But when i use profiler the column NDrange size is:

Profiler Result:

NDRange size : [1 319]

Work group size [318 1 1]

Could you please help me understand what happens?

Thanks in advance.

Firstly, if your last dimension is just “1”, I would make a 2D global work array. Second, I don’t know how good NVIDIA’s OpenCL implementation is at picking local work sizes, but most implementations I’ve used aren’t good at doing it. Especially considering your work dimensions are not a multiple of the number of threads per warp (32 on T10, 64 on Tesla). As a rule of thumb (and just that), you should have a multiple of 64 threads as the local work size. That being said, the each global work size must be a multiple of the corresponding local work size. What I do is round up the global work size to the nearest multiple of the local size. You may have to add an if statement to your kernel to make sure its work item is less than (319, 318), but this is the cost of doing business on a GPU.

Sorry to not actually answer your question, but I’ve never really used the profiler. However, it appears your kernel is reporting the correct dimensions.

ljbadenz · December 23, 2010, 6:28pm

The automatic local worksize algorithim NVIDIA have is shocking. With NVIDIA OpenCl always choose the local work size. Read the OpenCL programming and best pratices guide to work out how choose it correctly.

My code got a 4x speed up when I tuned the local work group size over the default.