What is behind Cooperative Groups? How about its performance?

Hi,

I have noticed that the CUDA 9 now have the new feature cooperative groups, which allows programmers to define the thread group into any size. This is a desired feature because it significantly simplifies programming. As this is supported on all GPU architectures, I am wondering how NVIDIA implement such feature? For instance, the GPU does not support global synchronization, but now the Cooperative Groups support it. I am wondering how NVIDIA does it in software. Does it have a very high overhead software layer for this? How about the performance?

Thanks.

global synchronization is not supported on all GPU architectures.

launch of a cooperative grid (necessary to be able to do grid.sync()) requires hardware support which can be queried via a device property. In a nutshell it is only supported on certain Pascal and Volta family members at this time.

Also, a grid-wide sync (assuming a cooperative grid launch) requires that the grid be sized to fit within the GPU instantaneous thread carrying capacity. All of the above is covered in the programming guide.

Beyond that, once you have a set of threads resident, there is a mechanism (obviously) to force all threads to make forward progress up to the grid sync point. AFAIK that mechanism is not published in detail.

Other types of intra-threadblock sync are accomplished via variants of the existing bar sync (PTX or SASS) instruction, and this is discussed in a variety of places including the PTX manual. You can probably also glean some understanding by disassembling and studying compiled code.

No synchronization operations are “free”. Introducing a sync point in your code will have some performance impact, due to scheduling inefficiency associated with warps that are at the sync point and are effectively idle. Beyond that, I don’t know of published performance data, it would strongly depend on the specific use case, I doubt that useful generalizations could be made.

Thank you for your reply. Very helpful.

Is there a list of GPUs that support cooperative group? I am going to have a TITAN X pascal, does it support the feature?