Why is __syncthreads() required before cluster_sync() on SM90?

202476410arsmart · August 16, 2025, 1:04pm

Hi all,

I’m working with the Hopper architecture and have a question about the synchronization scopes of __syncthreads() versus cute::cluster_sync().

I frequently see the following pattern, where an intra-block sync is called right before a cluster-wide sync:

// A single thread/warp prepares a resource in shared memory
if (/* designated thread */) {
    prepare_shared_resource();
}

// Why is this intra-block sync needed?
__syncthreads();

// Before the inter-CTA sync
cute::cluster_sync();

I had assumed cluster_sync (which uses barrier.cluster) would be a superset of __syncthreads, making the explicit __syncthreads() call redundant.

Could someone clarify the precise scope of barrier.cluster? Does it only synchronize the calling threads across the cluster’s CTAs? If so, is the __syncthreads() mandatory to prevent race conditions within each block before the cluster-level synchronization begins?

Thanks for any insights.

Topic		Replies	Views
__syncthreads thread syncronization CUDA Programming and Performance	7	18727	October 27, 2009
Semantics of __syncthreads CUDA Programming and Performance	18	18187	January 2, 2008
Early return and __syncthreads() function CUDA Programming and Performance synchronization	4	721	May 15, 2024
Why does single warp need syncthreads? CUDA Programming and Performance	2	1974	January 24, 2012
What happens when I call __syncthreads() in a warp group? CUDA Programming and Performance	6	167	June 27, 2025
use of __syncthreads it has the same meaning also for global variables? CUDA Programming and Performance	1	1217	April 8, 2009
How can I test to see the usefullness of `__syncthreads()`? CUDA Programming and Performance	2	339	August 12, 2023
__syncthreads() not a subset of cudaDeviceSynchronize()? CUDA Programming and Performance	3	658	June 2, 2022
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3417	April 30, 2014
__syncthreads() not syncing the threads, although not in if statement CUDA Programming and Performance	1	653	April 26, 2016

Why is __syncthreads() required before cluster_sync() on SM90?

Related topics