I hope this is the right forum to post a newbie question.
Iâ€™m using an Nvidia ION platform and I get the following:
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
After compiling the kernel successfully I run
err = clGetKernelWorkGroupInfo(ckKernel, cdDevice, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t),&szMaxLocalWorkSize, NULL);
And got szMaxLocalWorkSize of 512 workitems
Since each of my kernel required 49 bytes of local memory, 16kb/49 gives 334 workitems. Based on that I launched 334 workitems and that worked really well.
However, since it had 2 compute unit, each one having 16kBytes of local memory and 512 workitems, I figure I should be able to launch 2 sets of 334 workitems. When I did that my code bombed.
I wonder â€¦ is there something wrong with my thinking? When it says CL_DEVICE_LOCAL_MEM_SIZE is 16K does that mean 16K of local memory for each compute unit? Or is that 16k combined for all compute units? Can I have 2 different workgroup running on each of the compute unites? (i.e. do the 2 compute unites have to be instruction synchronized?).
In my host code I have 2 threads to control each of the 2 compute units. Each thread has its separate kernel (compiled from the same code), separate queues, and separate global buffers. I don’t find anything in the API for fine grained control of which kernel should run on which compute unit.