Why is __syncthreads() required before cluster_sync() on SM90?

Hi all,

I’m working with the Hopper architecture and have a question about the synchronization scopes of __syncthreads() versus cute::cluster_sync().

I frequently see the following pattern, where an intra-block sync is called right before a cluster-wide sync:

// A single thread/warp prepares a resource in shared memory
if (/* designated thread */) {
    prepare_shared_resource();
}

// Why is this intra-block sync needed?
__syncthreads();

// Before the inter-CTA sync
cute::cluster_sync();

I had assumed cluster_sync (which uses barrier.cluster) would be a superset of __syncthreads, making the explicit __syncthreads() call redundant.

Could someone clarify the precise scope of barrier.cluster? Does it only synchronize the calling threads across the cluster’s CTAs? If so, is the __syncthreads() mandatory to prevent race conditions within each block before the cluster-level synchronization begins?

Thanks for any insights.