What is the usage of cooperative group?

I see this blog: https://developer.nvidia.com/blog/cooperative-groups/
But I still can not understand why you have cooperative group?! Because, we still have 32 threads work together, as the physical warp limitation. So even if we psudo-split it into 16 and 16, the other 16 have to wait! ???

You can pseudo-split into two or more sub-groups (e.g. “tiles”) and each tile can do something different at the same time.

For example, tile 0 can compute a shuffle operation over its 16 threads, and tile 1 can compute a shuffle operation over its 16 threads, at the same time.

Yes, you can do the same thing without cooperative groups. The code looks a little less elegant.

Good luck trying to do this as elegantly, without cooperative groups.