What is behind Cooperative Groups? How about its performance?

kay21s · November 12, 2017, 8:34am

Hi,

I have noticed that the CUDA 9 now have the new feature cooperative groups, which allows programmers to define the thread group into any size. This is a desired feature because it significantly simplifies programming. As this is supported on all GPU architectures, I am wondering how NVIDIA implement such feature? For instance, the GPU does not support global synchronization, but now the Cooperative Groups support it. I am wondering how NVIDIA does it in software. Does it have a very high overhead software layer for this? How about the performance?

Thanks.

Robert_Crovella · November 12, 2017, 11:28am

global synchronization is not supported on all GPU architectures.

launch of a cooperative grid (necessary to be able to do grid.sync()) requires hardware support which can be queried via a device property. In a nutshell it is only supported on certain Pascal and Volta family members at this time.

Also, a grid-wide sync (assuming a cooperative grid launch) requires that the grid be sized to fit within the GPU instantaneous thread carrying capacity. All of the above is covered in the programming guide.

Beyond that, once you have a set of threads resident, there is a mechanism (obviously) to force all threads to make forward progress up to the grid sync point. AFAIK that mechanism is not published in detail.

Other types of intra-threadblock sync are accomplished via variants of the existing bar sync (PTX or SASS) instruction, and this is discussed in a variety of places including the PTX manual. You can probably also glean some understanding by disassembling and studying compiled code.

No synchronization operations are “free”. Introducing a sync point in your code will have some performance impact, due to scheduling inefficiency associated with warps that are at the sync point and are effectively idle. Beyond that, I don’t know of published performance data, it would strongly depend on the specific use case, I doubt that useful generalizations could be made.

kay21s · November 12, 2017, 1:36pm

Thank you for your reply. Very helpful.

Is there a list of GPUs that support cooperative group? I am going to have a TITAN X pascal, does it support the feature?

Topic		Replies	Views
Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization? CUDA Programming and Performance	11	1337	March 20, 2024
Cooperative_groups::this_grid() is not valid on my Volta architecture GPU. How to globally synchronize CUDA Programming and Performance cuda	3	167	June 4, 2024
Thread groups out of the active thread blocks CUDA Programming and Performance	1	312	November 19, 2020
Cooperative Group Grid synchronization leading to execution freezes CUDA Programming and Performance cuda	3	238	July 8, 2024
Can I use Independent Thread Scheduling and Cooperative Groups with Cuda 9 + Pascal CUDA Programming and Performance	4	1787	August 15, 2017
Flexible CUDA Thread Programming Technical Blog	0	247	August 21, 2022
Using cooperative groups across multi gpus CUDA Programming and Performance cuda	1	522	June 17, 2022
CUDA 9 Features Revealed: Volta, Cooperative Groups and More Technical Blog	45	592	November 2, 2017
Synchronizing specific threads in cooperative groups CUDA Programming and Performance	1	590	March 10, 2019
Possible race in CUDA Cooperative Groups CUDA Programming and Performance	4	735	December 9, 2020

What is behind Cooperative Groups? How about its performance?

Related topics