Performance of cooperative thread groups' grid sync vs atomics based grid sync

Hi,

We know that we could use atomics and thread fence to achieve grid sync/global sync within the kernel. It’s a s/w workaround. Did anyone benchmark the performance of cooperative thread group’s grid sync vs atomics based global sync? The atomics based s/w workaround would work on architectures > 2.0. But cooperative thread groups’ new grid sync functionality would work only on pascal and volta. How is CG’s grid sync implemented? What hardware feature (in pascal and volta) is required for it work? Does it use atomics internally?

Thanks