Cooperative Groups: Flexible CUDA Thread Programming

Originally published at: Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups, which aims to…

Hello, available on volta gpu only?

No, everything in this post is supported on Kepler and later GPUs. I will update the post to make that clear. There are features mentioned in the conclusion and the programming guide that require Pascal and later GPUs: specifically, those are multi-block synchronization and multi-GPU synchronization.

Hello,
Thank you for this great article, this is giving to the coders new ways to have more readable code.

I can see a typo error in the method *thread_sum*, the variable "i", is double declared.

During the presentation, if I understood correctly, you guys said this is a safe way to synchronize grid or GPUs and that the price to pay is that registers, local memory, etc. are cleared. So why don't use custom(coded) grid synchronization? In this case, no register/local mem. refresh is needed.

No, that's not the case. Cooperative groups inter-block synchronization will *not* invalidate the registers/lmem/shared memory. In the past the only supported way to synchronize across blocks was to exit the kernel and launch another -- that definitely would invalidate registers/lmem/shared memory!

Nice, probably I understood wrongly. Today, I am using custom grid sync in order to not lose registers. I am going to check the performance of the new sync. Thanks.

Nice post, thanks. When will the next post being published? Ready for it!

cudaMallocManaged(data, n * sizeof(int)); <-- should be &data

Thanks, fixed.

In
```
thread_group tile32 = cg::partition(this_thread_block(), 32);
```
I don't think there's a `cg::partition`, but there is a `cg::tiled_partition` and it's probably meant to be the latter. I only see `cooperative_groups::tiled_partition` in the CUDA Toolkit documentation, Sec. C.

Otherwise, I obtain this error, `error: namespace "cooperative_groups" has no member "partition"`. (on CUDA 9, GeForce GTX 980 Ti, so -arch='sm_52'; btw, any hardware donation for a Titan V or GTX 1080 Ti would be welcome!).

Fixed. Thanks!

Hi, is there any special hardware (for example, any register) supporting ballot function?

Excellent blog, thank you so much. As a minor observation, in reduce_sum_tile_shfl 'lane' seems unused.

Good article--but looking for the follow up on multiblock synchronization. The user guide only talks about synchronizing the entire grid_group, but how do I synchronize among a subset of blocks in a grid_group? For example, I want to synchronize threads in the "Z" dimension but not all X,Y,Z blocks.

Hi David, synchronizing a subset of blocks is not currently supported. Currently there's no partitioning capability for `grid_group`.

Good catch! Fixed.

It is part of the GPU instruction set. https://docs.nvidia.com/cud...

Hello,

I don't understand the purpose of second g.sync() in

temp[lane] = val;
g.sync(); // wait for all threads to store
if (lane < i) val += temp[lane + i];
g.sync(); // wait for all threads to load

Loads are done from second half of temp, while stores are essentially done to first half of temp (second half of vals is not updated because of "if (lane < i)"). Isn't second g.sync() unnecessary?

Hi Igor, while technically your suggestion may work for this specific code, in general it's incorrect to remove one of the syncs. You would probably have to mark temp as volatile, which is a hack. The g.sync()s prevent the compiler from performing code motion optimizations across the synchronization points. Without them you have a race condition, even if the data involved in the race is beyond what is used by the algorithm. As an example, if you changed from this downward reduction to a so-called "butterfly" reduction (using xor rather than + for the indexing), both syncs are absolutely required.