My system:OpenSUSE linux 11.0
NV-Software:
cudatoolkit_3.1_linux_64_suse11.2.run
devdriver_3.1_linux_64_258.19_opencl1.1.run
gpucomputingsdk_1_1_beta_linux.run
Hello!
I’ve some questions to the compute units of the different NV GPUs.
If I print out CL_DEVICE_MAX_COMPUTE_UNITS for a GTX260 (216 cores) I get 27 .
I’ve read that it has 28 ROPs and 9 shader clusters, so I thought:
OK, the ROPs are seen as the “OpenCL compute units” and one of the ROPs isn’t used (for whatever reason).
But the CL_DEVICE_MAX_COMPUTE_UNITS value for my GTX470 is only 14 although it has 40 ROPs.
Moreover it has 14 shader clusters…
So can anyone explain me how the number of compute units on a NV device comes about?
My next question is:
Is it possible to determine the number of concurrent running groups (like you can do it for work-items with get_local_size(X) )?
And is it possible to get an ID of the current work group in the set of the momentarily concurrently running work groups?
My problem is that I want to have one buffer in the global mem for each running work group.
The buffer should be bigger than the max local mem size of the device so I can’t use the local mem for these buffers.
So I thought the number of the buffers would have be equal to CL_DEVICE_MAX_COMPUTE_UNITS because there can’t be executed more than CL_DEVICE_MAX_COMPUTE_UNITS work groups at one moment or can there be more?
I wrote this small kernel to determine the maximum number of concurrently running groups:
const char* OpenCLSource[] = {
"\
__kernel void Test(volatile __global int* count,\n\
volatile __global int* a)\n\
{\n\
volatile int i;\n\
barrier(CLK_GLOBAL_MEM_FENCE);\n\
i = atomic_inc(count);\n\
barrier(CLK_GLOBAL_MEM_FENCE);\n\
\n\
a[get_global_id(0)] = i;\n\
\n\
barrier(CLK_GLOBAL_MEM_FENCE);\n\
i = atomic_inc(count);\n\
barrier(CLK_GLOBAL_MEM_FENCE);\n\
\n\
}\n"
};
But the values in the array ‘a’ after running this kernel with a work_dim of 1, a global_work_size of {100} and a local_work_size of {1} are much bigger than the number of CL_DEVICE_MAX_COMPUTE_UNITS.
Why can this happen although each group must decrement the count value before it can finish and hence before the next group can be executed on the same compute unit.
I know these are very long questions but I know nobody who can explain me this fact and I don’t know where my error in reasoning is… External Image
I hope someone of you can help me!
Best regards,
Lukas