Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization?

zwu065 · March 19, 2024, 9:59am

Hi,
I’m new to using the cooperative groups mechanism. I am wondering if the grid_sync in cooperative groups has the same functionality as the device-wide synchronization?

I mean if I have two dependent kernel launches, like kernel 1 and kernel 2.
Modern GPUs have already supported the device-wide synchronization without completing the kernel? We can fuse two kernels into single kernel by using grid_sync() or not? If yes, could you please give a a piece of sample code? Thank you very much.

Robert_Crovella · March 19, 2024, 12:39pm

the .sync() method is a device-wide sync in cooperative groups if it is (properly) applied to a grid group.

I wasn’t aware of any such thing in CUDA until cooperative groups were introduced in the CUDA 9 timeframe, I believe. ~~There isn’t any other explicit device-wide sync (barring the kernel launch boundary) in CUDA, that I am aware of, other than that provided by cooperative groups.~~

There are numerous questions and examples of proper usage of cooperative groups grid sync on various public forums as well as in CUDA sample codes.

Here is one such example on a public forum. You could imagine everything up to the grid.sync() line started out as kernel 1, and everything after that was kernel 2.

Here is a CUDA sample code that handles multiple reduction phases with a grid sync.

zwu065 · March 19, 2024, 12:52pm

I see. How can we determine the size of each cooperative_group? Let’s say if I want to work on an array of an arbitrary size N and I want to have each thread to work on one element of it.

How can I decide each cooperative group size. Suppose I need to partitioning the array into many segments and want each cooperative group to work on each segment and performs a segmented scan? Here size of each segment are different and will known when entering kernel in the runtime. Does CUDA support groups of different size determined in runtime?

Robert_Crovella · March 19, 2024, 12:57pm

Yes, perhaps you should read the CG documentation. Even the CUDA sample code I already linked demonstrates usage of groups of different sizes.

However these groups are all “carved out” of the grid that you launch. in-kernel usage of cooperative groups does not change the size of the grid you launched. All groups are determined at runtime, and they are created based on either a static or dynamic subdivision of threads in the block, mainly. At the grid level there is basically only one size - the grid, although with Hopper and threadblock clusters, even that is starting to change.

So for “kernel fusion” of two dissimilar sized grids, one thing you might want to consider is designing each kernel using a methodology like grid stride, to allow for flexible use of the grid (the threads) independent of problem size.

zwu065 · March 19, 2024, 2:31pm

Thank you!
However, actually I’ve read the CG documentation as well as the cuda sample code. For example, the one you shared with me as below.

I still feel confusing for some points.

For example, which line shows each group size, how can we set it by ourselves manually.
I think after we set each group size, CUDA has provided cg::inclusive_scan() API to perform scan in tile level? Not in cooperative group level?
I feel the the functionality of both grid.sync() and cuda::barrier can serve as a device-wide sync?

And after cooperative groups are introduced from CUDA9, that means kernels are no longer need to split into multiple kernels in if we need global synchronization? I’m not sure if my understanding is right or not? Or is there any tradeoffs between using grid_sync and breaking into multiple kernel launches. Does the former always performs better than the latter approach?

Robert_Crovella · March 19, 2024, 3:23pm

This line creates a group that is the size of the grid. This line creates a group the size of a threadblock. This line takes a block size group that is handed to it, and creates a sub-group that is the size of a warp.

cooperative groups don’t have infinite flexibility. It is essentially working in/with the hierarchy already established by CUDA. If you have a grid group that is 100,000 threads, for example, I don’t know how to create two sub groups of 50,000 threads each. Up to approximately warp level (i.e. 32 threads) you have considerable flexibility in determining group sizes, and the session 9 in this online training series provides additional coverage of these ideas.

Correct, not everything is symmetrical in CG. I don’t think you can do a cg::inclusive_scan at the grid level, for example, although someone may point out that I am wrong.

Yes, there are additional mechanisms for synchronization in the cuda:: namespace as well as in libcu++. These are more recent additions to the CUDA arsenal, and I don’t always have everything in mind when answering questions here. So yes, correct, CG is not the only way to synchronize.

zwu065 · March 19, 2024, 3:38pm

Thank you very much.

Therefore, we cannot create groups of any size by hands in kernel I think, am I understanding correct?

Also, we cannot perform cg::scan on each group level as a segmented prefix sum, am I correct?

Lastly, if I only want to fuse multiple small kernels (each one did one step) into one kernel, then CG can achieve by cg::grid_group grid = cg::this_grid();, right? Btw, does the cg::sync(grid); and grid.sync() performs the same functionality?

Robert_Crovella · March 19, 2024, 6:35pm

there are some sizes you can create, and some sizes you cannot

Not all cg methods are supported for all group sizes/types.

I’m not sure what you are asking. You could “fuse” kernels by declaring a grid group, then doing a grid group sync in between the code for successive kernels.

Yes, AFAIK.

zwu065 · March 20, 2024, 2:06am

Thank you very much!
The size should be grid size, threadblock, warp size or multiple of 32, those kind of numbers I guess, right?

Robert_Crovella · March 20, 2024, 2:35am

You can create a group of 4 threads. You can create a group of 8 threads. I haven’t personally tabulated a table of all possible sizes.

zwu065 · March 20, 2024, 7:32am

You mean using cg::thread_block_tile<x> tile32 = cg::tiled_partition<x>(cta) set tile size which is less than 32?

striker159 · March 20, 2024, 7:42am

It is all explained in the programming guide here CUDA C++ Programming Guide.

The tiled_partition method is a collective operation that partitions the parent group into a one-dimensional, row-major, tiling of subgroups. A total of ((size(parent)/tilesz) subgroups will be created, therefore the parent group size must be evenly divisible by the Size. The allowed parent groups are thread_block or thread_block_tile.

Functionality is limited to native hardware sizes, 1/2/4/8/16/32 and the cg::size(parent) must be greater than the Size parameter. The templated version of tiled_partition supports 64/128/256/512 sizes as well, but some additional steps are required on Compute Capability 7.5 or lower, refer to Thread Block Tile for details.

Topic		Replies	Views
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12760	February 7, 2023
Can I use cooperative_groups::sync(grid) in child kernel (CUDA dynamic parallelism)? CUDA Programming and Performance	1	529	January 24, 2022
What is behind Cooperative Groups? How about its performance? CUDA Programming and Performance	2	1967	November 12, 2017
Cooperative_groups::this_grid() is not valid on my Volta architecture GPU. How to globally synchronize CUDA Programming and Performance cuda	3	233	June 4, 2024
Cooperative Group Grid synchronization leading to execution freezes CUDA Programming and Performance cuda	3	333	July 8, 2024
CUDA Fortran - global synchronization Legacy PGI Compilers	19	5856	February 11, 2019
Synchronizing specific threads in cooperative groups CUDA Programming and Performance	1	631	March 10, 2019
Performance of cooperative thread groups' grid sync vs atomics based grid sync CUDA Programming and Performance	0	414	January 6, 2018
Does GTX 1050 ti support grid_group? CUDA Setup and Installation	1	824	October 8, 2019
Using cooperative groups across multi gpus CUDA Programming and Performance cuda	1	565	June 17, 2022

Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization?

Related topics