Optimal number of work-groups and work-items for GPU

Hi guys,

I need to know the optimal number of work-items I can use in one command. Let’s say optimal number is 1 million work-items, and my user specify he/she wants 2 million computations, so I’ll will batch them up into 2 commands with 1 million work-items each.

Does this optimal number has to do with amount of global memory which the discrete GPU has? For example, for a work-item which use 10 bytes of global memory and the global memory is 1 GB, the optimal number is work-items is 100 million?

Please advise.

I am sorry if my questions sound newbie.

While being limited by the amount of global memory, there are still other factors. Search the OpenCL reference for
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_COMPILE_WORK_GROUP_SIZE
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
Basically you gather that information and compute the local work size, which should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
The global work size is a multiple of that local work size.
Lets say you have 1000 items your user wants to process.
You have a CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE of 32.
Your local work size could then be: 2,4,32 = 256 items in a work-group = 32 x 8 (Depending on what CL_KERNEL_WORK_GROUP_SIZE gives you)
Your optimal global work size would be 4x the local group size = 1024
But 1024 is too much you say?
You just skip the last 24 items in kernel, as using the devices PREFERRED size and leaving out the overlapping items is still way faster than not using the MULTIPLE your device likes.
Then you still have to make sure that the global mem is enough for the 1000 items. If it isnt, partition the whole bunch into parts that fit into global mem (like making 2 runs with 500 items and a global work size of 512)