Questions regarding the OpenCL compute units

My system:OpenSUSE linux 11.0

NV-Software:

cudatoolkit_3.1_linux_64_suse11.2.run

devdriver_3.1_linux_64_258.19_opencl1.1.run

gpucomputingsdk_1_1_beta_linux.run

Hello!

I’ve some questions to the compute units of the different NV GPUs.

If I print out CL_DEVICE_MAX_COMPUTE_UNITS for a GTX260 (216 cores) I get 27 .

I’ve read that it has 28 ROPs and 9 shader clusters, so I thought:

OK, the ROPs are seen as the “OpenCL compute units” and one of the ROPs isn’t used (for whatever reason).

But the CL_DEVICE_MAX_COMPUTE_UNITS value for my GTX470 is only 14 although it has 40 ROPs.

Moreover it has 14 shader clusters…

So can anyone explain me how the number of compute units on a NV device comes about?

My next question is:

Is it possible to determine the number of concurrent running groups (like you can do it for work-items with get_local_size(X) )?

And is it possible to get an ID of the current work group in the set of the momentarily concurrently running work groups?

My problem is that I want to have one buffer in the global mem for each running work group.

The buffer should be bigger than the max local mem size of the device so I can’t use the local mem for these buffers.

So I thought the number of the buffers would have be equal to CL_DEVICE_MAX_COMPUTE_UNITS because there can’t be executed more than CL_DEVICE_MAX_COMPUTE_UNITS work groups at one moment or can there be more?

I wrote this small kernel to determine the maximum number of concurrently running groups:

const char* OpenCLSource[] = {

"\

__kernel void Test(volatile __global int* count,\n\

				   volatile __global int* a)\n\

{\n\

  volatile int i;\n\

  barrier(CLK_GLOBAL_MEM_FENCE);\n\

	i = atomic_inc(count);\n\

  barrier(CLK_GLOBAL_MEM_FENCE);\n\

  \n\

  a[get_global_id(0)] = i;\n\

  \n\

  barrier(CLK_GLOBAL_MEM_FENCE);\n\

	i = atomic_inc(count);\n\

  barrier(CLK_GLOBAL_MEM_FENCE);\n\

  \n\

}\n"

};

But the values in the array ‘a’ after running this kernel with a work_dim of 1, a global_work_size of {100} and a local_work_size of {1} are much bigger than the number of CL_DEVICE_MAX_COMPUTE_UNITS.

Why can this happen although each group must decrement the count value before it can finish and hence before the next group can be executed on the same compute unit.

I know these are very long questions but I know nobody who can explain me this fact and I don’t know where my error in reasoning is… External Image

I hope someone of you can help me!

Best regards,

Lukas

The CL_DEVICE_MAX_COMPUTE_UNITS refers to the number of simultaneous multiprocessors (SM) on the device.

Thanks!

I know what the OpenCL spec says but I don’t know what the value of CL_DEVICE_MAX_COMPUTE_UNITS has to do with the internal structure of the mentioned NVIDIA GPUs …

I’ve also printed out the value for an ATI HD5850 - it was 18 and this GPU has 18 shader clusters. So I think the CL_DEVICE_MAX_COMPUTE_UNITS for the current GPUs is the number of shader clusters. God knows why it was 27 for a GTX260…

Best reagrds!