How to group threads from different blocks with co-operative groups


I understand that we can create threads blocks explicitly using cooperative groups and apply inter grid synchronization. My objective is to launch 20 blocks each having 1024 threads. Because I cannot exceed the 1024 max limit(max per block) it suits well to spread threads across different blocks and call thread_group.sync() instead of __syncthreads(). For example if I now want to combine 2048 threads as a group I should be able to combine blockIdx.x={0,1}, blockIdx.x={1,2} and so on.

Can anyone explain how to do this ?? I read many of the example in NVIDIA developer manuals but couldn’t pinpoint the exact details or examples.

Not possible, currently, with one exception.

The only thread group bigger than a threadblock size is the grid group.

This could change in the future.