When a multiprocessor executes multiple workgroups at one moment, the number of registers is shared between the workgroups, so with workgroup size 256, one workitem using 8 registers and total of 16384 registers can be 8 workgroups processed simultanely. Is this limited also by local (OpenCL local, CUDA shared) memory size? If I use all aviable local memory in each workgroup, does it mean that I limit the multiprocessor to process only single workgroup? The Best Practices Guide states that shared memory can “act as a constraint on occupancy”, but the example speaks about different problem.
Thank you for your insights.
Thanks again - as I have written above, the documentation (I have read again the section you mention) speaks still only about registers, not about shared memory, although it can be extrapolated even for that. I just wanted to be sure.