Hi all,
I’m working with the Hopper architecture and have a question about the synchronization scopes of __syncthreads() versus cute::cluster_sync().
I frequently see the following pattern, where an intra-block sync is called right before a cluster-wide sync:
// A single thread/warp prepares a resource in shared memory
if (/* designated thread */) {
prepare_shared_resource();
}
// Why is this intra-block sync needed?
__syncthreads();
// Before the inter-CTA sync
cute::cluster_sync();
I had assumed cluster_sync (which uses barrier.cluster) would be a superset of __syncthreads, making the explicit __syncthreads() call redundant.
Could someone clarify the precise scope of barrier.cluster? Does it only synchronize the calling threads across the cluster’s CTAs? If so, is the __syncthreads() mandatory to prevent race conditions within each block before the cluster-level synchronization begins?
Thanks for any insights.