What is the usage of cooperative group?

I see this blog: https://developer.nvidia.com/blog/cooperative-groups/
But I still can not understand why you have cooperative group?! Because, we still have 32 threads work together, as the physical warp limitation. So even if we psudo-split it into 16 and 16, the other 16 have to wait! ???

You can pseudo-split into two or more sub-groups (e.g. “tiles”) and each tile can do something different at the same time.

For example, tile 0 can compute a shuffle operation over its 16 threads, and tile 1 can compute a shuffle operation over its 16 threads, at the same time.

Yes, you can do the same thing without cooperative groups. The code looks a little less elegant.

Good luck trying to do this as elegantly, without cooperative groups.

Wait, you mean, we can split the physical constraint!? Like, we can split a warp? So can be like, first 8 threads do task 1, and at the same time, last 8 threads do another task 2??

Thank you!!

Yes, the volta execution model makes that possible. We have to be somewhat careful what we mean by “at the same time”, but nevertheless the cooperative groups examples are pretty self evident, in my view. If task 1 and task 2 are similar, then “at the same time” can have a particular meaning. If task 1 and task 2 are completely different, then “at the same time” will end up meaning something a little bit different.

I gave a specific example already:

In that case “at the same time” can mean “in the same clock cycle, work is being done in both sub-groups”

1 Like

wait a minute, maybe you have a typo in your example? You said tile 0 and 1 can do different tasks at the same time, but you said both of them are “computing a shuffle”.

Maybe your example could be:
tile 0 (with first 16 threads in a warp) is doing shuffle, tile 1 (with last 16 threads in the warp) is reading memory(or do other computing)

By the way, given that warp can overlap each other, I am not very sure how to verify this, cooperative group split warp, conclusion…

Thanks!!!

Maybe I can do an experiment like this:

Tile 0 has 16 threads, and tile 1 has 16 threads. tile 0 has shorter tasks and tile 1 has longer tasks(double workload). (same kind of tasks) So if two tiles are not bounded, tile 0 can fetch another new short task after finished, but not wait tile 1.

So comparison group is, both tiles have longer tasks.

If experiment 1 is faster (apparently, maybe 50-60%) than 2, so cooperative group can split warp.

Will this work?

If you have the same work to do (e.g. a shuffle) amongst tiles, then its possible that work could be issued in the same cycle (to all tiles).

If you have different work to do (different instructions) it’s not possible to issue that work to all tiles, in the same cycle. You can still write code that is tile specific, but the tile processing will not take place in the same cycles, across tiles.

Well… thank you very much for your explanation, although I still have some confusion…

So you mean, within a warp, every thread still need to do the same task at every specific time step? Even use cooperative group?

Also, maybe I can have 2 warps as a cooperative group, and another warp itself as a new group, therefore two groups could do different things at every specific time step?

Thank you!!!

cooperative groups do not add new functionality or change hardware scheduling etc.
Effectively, they are just a wrapper for simpler programming. (No need for manually calculating lane masks, etc)

Recent cuda versions allow thread block tiles spanning across multiple warps, with restricted functionality compared to single-warp groups. 1. Introduction — CUDA C Programming Guide

1 Like

Yes. This is the nature of GPU instruction scheduling, and the volta execution model does not change this. In any given instruction cycle, at most one instruction can be issued for a warp, regardless of the state of the warp. (For the sake of this discussion, I am ignoring any considerations of dual-issue capable warp schedulers. AFAIK, no recent GPU arch has dual-issue capable warp schedulers.)

Yes. CG does not change that.

Correct. With or without CG, as soon as you go beyond a single warp, then you have the possibility for one warp to be issued a particular instruction, while another warp is issued another instruction, in the same clock cycle.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.