Clarification on cooperative_groups::tiled_partition<64>::sync() behavior in a 128-thread block

samrivaKS · May 10, 2026, 10:00am

Hi,

I am seeing unexpected behavior with Cooperative Groups and I would like to ask for clarification.

I have a kernel launched with 128 threads per block. Inside the kernel I create a tiled partition of size 64:

**namespace cg = cooperative_groups;

__global__ void my_kernel(...)
{
auto block = cg::this_thread_block();
auto tile64 = cg::tiled_partition<64>(block);
**
// work**

tile64.sync();
**
// more work**
}**

My expectation is that tile64.sync() should synchronize only the 64 threads belonging to each tile.

However, in my test it looks like sync() on the first 64-thread tile is also waiting for the other 64 threads in the block. In other words, it behaves as if the synchronization is block-wide, or at least coupled across the two 64-thread tiles.

What I observe is:

Kernel uses 128 threads per block.
tiled_partition<64>(block) creates two 64-thread tiles.
tile64.sync() appears to stall until both halves of the block reach the synchronization point.
If I change the implementation to use 32-thread groups instead, the behavior is fine and I do not see this issue.

So I would like to understand:

Is thread_block_tile<64>::sync() guaranteed to wait only for the 64 threads in that tile?
Or can the implementation internally use a block-wide barrier / block-wide participation when tile size is 64?
Is there any architecture-dependent behavior here?
Are there restrictions or caveats for tiled_partition<64> that do not apply to tiled_partition<32>?

I am especially interested in whether this is:

expected behavior,
a limitation of the implementation,
or a bug / unsupported usage pattern.

I’m using CUDA13 and NVIDIA RTX A6000

I’ve asked 4 LLMs and they consistently say that from hardware point of view sync is possible only up to 32threads and suggested to not sync on tiles larger than 32…

Thanks.

striker159 · May 10, 2026, 11:22am

Can you show the kernel which makes you believe that

tile64.sync() appears to stall until both halves of the block reach the synchronization point.

?

Thread block tiles should only synchronize the respective threads in the group. Since this is not supported in hardware for tiles >32 threads, it is emulated. The implementation is publicly available in the cuda include directory

samrivaKS · May 10, 2026, 12:35pm

well the kernel is quite complex and cannot be shared but after a lot of debugging I’ve found that the first 64 threads were blocked in sync() this way (blocked for hundreds of ms, because each 64 threads are async):

        auto tile64 = cg::tiled_partition<64>(block);

        tile64.sync();

Changing to tile32 everything works as expected.

AI is suggesting that tile64 is performing a
BAR.SYNC — full block barrier (__syncthreads())

While tile32 (or lower) are performing subgroup/warp barriers.

I didn’t trusted them, but the results I got seems to show that tile64.sync() is not working as expected

striker159 · May 10, 2026, 6:22pm

You should be able to check the ptx or sass code to verify there is no bar.sync. For example, this minimal code does not make use of bar.sync within tile64.sync(). (I added printfs to make it more clear which instructions belong to tile64.sync()).

#include <cooperative_groups.h>
#include <cstdio>

namespace cg = cooperative_groups;

__global__ void my_kernel()
{
    auto block = cg::this_thread_block();
    auto tile64 = cg::tiled_partition<64>(block);

    printf("A\n");
    tile64.sync();
    printf("B\n");
}

mov.u64 	%rd4, $str;
	cvta.global.u64 	%rd5, %rd4;
	{
		st.param.b64 	[param0+0], %rd5;
		st.param.b64 	[param1+0], 0;
		call.uni (retval0),
		vprintf,
		(
		param0,
		param1
		);
		ld.param.b32 	%r16, [retval0+0];
	}
	and.b32  	%r17, %r14, 134217726;
	mov.u32 	%r18, 3;
	shl.b32 	%r4, %r18, %r17;
	bar.warp.sync 	-1;
	@%p2 bra 	$L__BB0_6;
	atom.or.acq_rel.cta.b32 %r19,[%rd2],%r3;
	or.b32  	%r21, %r19, %r3;
	and.b32  	%r22, %r21, %r4;
	setp.eq.s32 	%p3, %r22, %r4;
	@%p3 bra 	$L__BB0_5;
	bra.uni 	$L__BB0_4;
$L__BB0_5:
	not.b32 	%r25, %r4;
	red.and.relaxed.cta.b32 [%rd2],%r25;
	bra.uni 	$L__BB0_6;
$L__BB0_4:
	ld.acquire.cta.u32 %r23,[%rd2];
	and.b32  	%r24, %r23, %r3;
	setp.eq.s32 	%p4, %r24, 0;
	@%p4 bra 	$L__BB0_6;
	bra.uni 	$L__BB0_4;
$L__BB0_6:
	bar.warp.sync 	-1;
	mov.u64 	%rd9, $str$1;
	cvta.global.u64 	%rd10, %rd9;
	{
		st.param.b64 	[param0+0], %rd10;
		st.param.b64 	[param1+0], 0;
		call.uni (retval0),
		vprintf,
		(
		param0,
		param1
		);
		ld.param.b32 	%r26, [retval0+0];
	}

samrivaKS · May 11, 2026, 6:08am

Thanks for the example, I was able to detect my problem which is NOT on sync() but in the previous line, the one creating the partition.

if (threadPerCarrier == 64) {
auto tile64 = cg::tiled_partition<64>(block);
tile64.sync();
}
else if (threadPerCarrier == 32)
{
auto tile32 = cg::tiled_partition<32>(block);
tile32.sync();
}

The if caused warp 0 and warp 1 to participate on tile_partition<64> but warp 2 and 3 do not…
and parent block collective semantics are violated

From NVIDIA docs: "“Partitioning a group is a collective operation and all threads in the group must participate.” and ““The implementation may cause the calling thread to wait until all the members of the parent group have invoked the operation before resuming execution.””

For some reason when using tile32 it “accidentally work” because (probably!) only one warp is involved and the hardware naturally tracks active lanes (?)

CONCLUSION: I’m going to move out the tiled_partition from the if

SUGGESTION: nvcc should probably raise a warning when creating tiled_partitions inside a if, because “sometimes it works”

Curefab · May 11, 2026, 7:18am

bar.sync is the most efficient way to synchronize 64 threads only (from two warps).

bar.sync uses a numbered slot 0..15 and a number of warps*32 to wait for as many threads.

It not necessarily waits for the whole block.

If I remember correctly, cooperative groups don’t use bar.sync, but a slower mechanism.

Topic		Replies	Views
How can I sync within partial threads in a block? CUDA Programming and Performance	1	329	March 9, 2024
Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization? CUDA Programming and Performance	11	2467	March 20, 2024
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	13059	February 7, 2023
Re: Cooperative groups are much slower than CUB CUDA Programming and Performance	1	141	December 16, 2024
Synchronizing specific threads in cooperative groups CUDA Programming and Performance	1	658	March 10, 2019
Why barrier synchronisation not needed here from best practices guide page number 29 CUDA Programming and Performance	6	7966	June 25, 2010
Cooperative Group Grid synchronization leading to execution freezes CUDA Programming and Performance cuda	2	416	June 24, 2024
Cooperative_groups::this_grid() is not valid on my Volta architecture GPU. How to globally synchronize CUDA Programming and Performance cuda	2	268	June 4, 2024
Synchronizing cooperative groups inside a while loop CUDA NVCC Compiler cuda	0	512	May 18, 2023
Thread groups out of the active thread blocks CUDA Programming and Performance	1	359	November 19, 2020

Clarification on cooperative_groups::tiled_partition<64>::sync() behavior in a 128-thread block

Related topics