I see this blog: https://developer.nvidia.com/blog/cooperative-groups/
But I still can not understand why you have cooperative group?! Because, we still have 32 threads work together, as the physical warp limitation. So even if we psudo-split it into 16 and 16, the other 16 have to wait! ???
You can pseudo-split into two or more sub-groups (e.g. “tiles”) and each tile can do something different at the same time.
For example, tile 0 can compute a shuffle operation over its 16 threads, and tile 1 can compute a shuffle operation over its 16 threads, at the same time.
Yes, you can do the same thing without cooperative groups. The code looks a little less elegant.
Good luck trying to do this as elegantly, without cooperative groups.