Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization?

Hi,
I’m new to using the cooperative groups mechanism. I am wondering if the grid_sync in cooperative groups has the same functionality as the device-wide synchronization?

I mean if I have two dependent kernel launches, like kernel 1 and kernel 2.
Modern GPUs have already supported the device-wide synchronization without completing the kernel? We can fuse two kernels into single kernel by using grid_sync() or not? If yes, could you please give a a piece of sample code? Thank you very much.

the .sync() method is a device-wide sync in cooperative groups if it is (properly) applied to a grid group.

I wasn’t aware of any such thing in CUDA until cooperative groups were introduced in the CUDA 9 timeframe, I believe. There isn’t any other explicit device-wide sync (barring the kernel launch boundary) in CUDA, that I am aware of, other than that provided by cooperative groups.

There are numerous questions and examples of proper usage of cooperative groups grid sync on various public forums as well as in CUDA sample codes.

Here is one such example on a public forum. You could imagine everything up to the grid.sync() line started out as kernel 1, and everything after that was kernel 2.

Here is a CUDA sample code that handles multiple reduction phases with a grid sync.

I see. How can we determine the size of each cooperative_group? Let’s say if I want to work on an array of an arbitrary size N and I want to have each thread to work on one element of it.

How can I decide each cooperative group size. Suppose I need to partitioning the array into many segments and want each cooperative group to work on each segment and performs a segmented scan? Here size of each segment are different and will known when entering kernel in the runtime. Does CUDA support groups of different size determined in runtime?

Yes, perhaps you should read the CG documentation. Even the CUDA sample code I already linked demonstrates usage of groups of different sizes.

However these groups are all “carved out” of the grid that you launch. in-kernel usage of cooperative groups does not change the size of the grid you launched. All groups are determined at runtime, and they are created based on either a static or dynamic subdivision of threads in the block, mainly. At the grid level there is basically only one size - the grid, although with Hopper and threadblock clusters, even that is starting to change.

So for “kernel fusion” of two dissimilar sized grids, one thing you might want to consider is designing each kernel using a methodology like grid stride, to allow for flexible use of the grid (the threads) independent of problem size.

Thank you!
However, actually I’ve read the CG documentation as well as the cuda sample code. For example, the one you shared with me as below.

I still feel confusing for some points.

  1. For example, which line shows each group size, how can we set it by ourselves manually.

  2. I think after we set each group size, CUDA has provided cg::inclusive_scan() API to perform scan in tile level? Not in cooperative group level?

  3. I feel the the functionality of both grid.sync() and cuda::barrier can serve as a device-wide sync?

And after cooperative groups are introduced from CUDA9, that means kernels are no longer need to split into multiple kernels in if we need global synchronization? I’m not sure if my understanding is right or not? Or is there any tradeoffs between using grid_sync and breaking into multiple kernel launches. Does the former always performs better than the latter approach?

This line creates a group that is the size of the grid. This line creates a group the size of a threadblock. This line takes a block size group that is handed to it, and creates a sub-group that is the size of a warp.

cooperative groups don’t have infinite flexibility. It is essentially working in/with the hierarchy already established by CUDA. If you have a grid group that is 100,000 threads, for example, I don’t know how to create two sub groups of 50,000 threads each. Up to approximately warp level (i.e. 32 threads) you have considerable flexibility in determining group sizes, and the session 9 in this online training series provides additional coverage of these ideas.

Correct, not everything is symmetrical in CG. I don’t think you can do a cg::inclusive_scan at the grid level, for example, although someone may point out that I am wrong.

Yes, there are additional mechanisms for synchronization in the cuda:: namespace as well as in libcu++. These are more recent additions to the CUDA arsenal, and I don’t always have everything in mind when answering questions here. So yes, correct, CG is not the only way to synchronize.

Thank you very much.

Therefore, we cannot create groups of any size by hands in kernel I think, am I understanding correct?

Also, we cannot perform cg::scan on each group level as a segmented prefix sum, am I correct?

Lastly, if I only want to fuse multiple small kernels (each one did one step) into one kernel, then CG can achieve by cg::grid_group grid = cg::this_grid();, right? Btw, does the cg::sync(grid); and grid.sync() performs the same functionality?

there are some sizes you can create, and some sizes you cannot

Not all cg methods are supported for all group sizes/types.

I’m not sure what you are asking. You could “fuse” kernels by declaring a grid group, then doing a grid group sync in between the code for successive kernels.

Yes, AFAIK.

Thank you very much!
The size should be grid size, threadblock, warp size or multiple of 32, those kind of numbers I guess, right?

You can create a group of 4 threads. You can create a group of 8 threads. I haven’t personally tabulated a table of all possible sizes.

You mean using cg::thread_block_tile<x> tile32 = cg::tiled_partition<x>(cta) set tile size which is less than 32?

It is all explained in the programming guide here CUDA C++ Programming Guide.

The tiled_partition method is a collective operation that partitions the parent group into a one-dimensional, row-major, tiling of subgroups. A total of ((size(parent)/tilesz) subgroups will be created, therefore the parent group size must be evenly divisible by the Size. The allowed parent groups are thread_block or thread_block_tile.

Functionality is limited to native hardware sizes, 1/2/4/8/16/32 and the cg::size(parent) must be greater than the Size parameter. The templated version of tiled_partition supports 64/128/256/512 sizes as well, but some additional steps are required on Compute Capability 7.5 or lower, refer to Thread Block Tile for details.