Cooperative Groups: Flexible CUDA Thread Programming

jwitsoe · October 5, 2017, 4:17am

Originally published at: Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups, which aims to…

anon69558326 · October 6, 2017, 5:47pm

Hello, available on volta gpu only?

anon95180265 · October 8, 2017, 11:05pm

No, everything in this post is supported on Kepler and later GPUs. I will update the post to make that clear. There are features mentioned in the conclusion and the programming guide that require Pascal and later GPUs: specifically, those are multi-block synchronization and multi-GPU synchronization.

anon77838210 · October 9, 2017, 1:34pm

Hello,
Thank you for this great article, this is giving to the coders new ways to have more readable code.

I can see a typo error in the method *thread_sum*, the variable "i", is double declared.

anon74766211 · October 11, 2017, 5:29am

During the presentation, if I understood correctly, you guys said this is a safe way to synchronize grid or GPUs and that the price to pay is that registers, local memory, etc. are cleared. So why don't use custom(coded) grid synchronization? In this case, no register/local mem. refresh is needed.

anon95180265 · October 11, 2017, 9:43pm

No, that's not the case. Cooperative groups inter-block synchronization will *not* invalidate the registers/lmem/shared memory. In the past the only supported way to synchronize across blocks was to exit the kernel and launch another -- that definitely would invalidate registers/lmem/shared memory!

anon74766211 · October 12, 2017, 12:32am

Nice, probably I understood wrongly. Today, I am using custom grid sync in order to not lose registers. I am going to check the performance of the new sync. Thanks.

anon24515818 · October 13, 2017, 1:24pm

Nice post, thanks. When will the next post being published? Ready for it!

anon67615513 · December 7, 2017, 9:48pm

cudaMallocManaged(data, n * sizeof(int)); <-- should be &data

anon95180265 · December 11, 2017, 7:36am

Thanks, fixed.

anon42508259 · January 12, 2018, 2:42pm

In
```
thread_group tile32 = cg::partition(this_thread_block(), 32);
```
I don't think there's a `cg::partition`, but there is a `cg::tiled_partition` and it's probably meant to be the latter. I only see `cooperative_groups::tiled_partition` in the CUDA Toolkit documentation, Sec. C.

Otherwise, I obtain this error, `error: namespace "cooperative_groups" has no member "partition"`. (on CUDA 9, GeForce GTX 980 Ti, so -arch='sm_52'; btw, any hardware donation for a Titan V or GTX 1080 Ti would be welcome!).

anon95180265 · January 14, 2018, 9:24pm

Fixed. Thanks!

anon28553881 · February 23, 2018, 12:23pm

Hi, is there any special hardware (for example, any register) supporting ballot function?

anon40435935 · June 15, 2018, 12:16am

Excellent blog, thank you so much. As a minor observation, in reduce_sum_tile_shfl 'lane' seems unused.

anon59493964 · July 23, 2018, 10:27pm

Good article--but looking for the follow up on multiblock synchronization. The user guide only talks about synchronizing the entire grid_group, but how do I synchronize among a subset of blocks in a grid_group? For example, I want to synchronize threads in the "Z" dimension but not all X,Y,Z blocks.

anon95180265 · July 24, 2018, 5:54pm

Hi David, synchronizing a subset of blocks is not currently supported. Currently there's no partitioning capability for `grid_group`.

anon95180265 · July 24, 2018, 6:02pm

Good catch! Fixed.

anon95180265 · July 24, 2018, 6:04pm

It is part of the GPU instruction set. https://docs.nvidia.com/cud...

anon34488880 · July 25, 2018, 1:58pm

Hello,

I don't understand the purpose of second g.sync() in

temp[lane] = val;
g.sync(); // wait for all threads to store
if (lane < i) val += temp[lane + i];
g.sync(); // wait for all threads to load

Loads are done from second half of temp, while stores are essentially done to first half of temp (second half of vals is not updated because of "if (lane < i)"). Isn't second g.sync() unnecessary?

anon95180265 · August 1, 2018, 3:25am

Hi Igor, while technically your suggestion may work for this specific code, in general it's incorrect to remove one of the syncs. You would probably have to mark temp as volatile, which is a hack. The g.sync()s prevent the compiler from performing code motion optimizations across the synchronization points. Without them you have a race condition, even if the data involved in the race is beyond what is used by the algorithm. As an example, if you changed from this downward reduction to a so-called "butterfly" reduction (using xor rather than + for the indexing), both syncs are absolutely required.

Topic		Replies	Views
How to launch CUDA Cooperative Groups Standard Deviation example kernel? CUDA Programming and Performance	11	2890	February 12, 2023
Using CUDA Warp-Level Primitives Technical Blog	20	1960	April 15, 2024
Using Shared Memory in CUDA C/C++ Technical Blog	36	1987	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics Technical Blog	8	826	May 29, 2021
Global thread barrier CUDA Programming and Performance	78	85670	December 23, 2011
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55539	March 8, 2007
Faster Parallel Reductions on Kepler Technical Blog	53	1842	September 4, 2021
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011

Cooperative Groups: Flexible CUDA Thread Programming

Related topics