Clarification on cooperative_groups::tiled_partition<64>::sync() behavior in a 128-thread block

Hi,

I am seeing unexpected behavior with Cooperative Groups and I would like to ask for clarification.

I have a kernel launched with 128 threads per block. Inside the kernel I create a tiled partition of size 64:

**namespace cg = cooperative_groups;

__global__ void my_kernel(...)
{
auto block = cg::this_thread_block();
auto tile64 = cg::tiled_partition<64>(block);
**
// work**

tile64.sync();
**
// more work**
}**

My expectation is that tile64.sync() should synchronize only the 64 threads belonging to each tile.

However, in my test it looks like sync() on the first 64-thread tile is also waiting for the other 64 threads in the block. In other words, it behaves as if the synchronization is block-wide, or at least coupled across the two 64-thread tiles.

What I observe is:

  • Kernel uses 128 threads per block.

  • tiled_partition<64>(block) creates two 64-thread tiles.

  • tile64.sync() appears to stall until both halves of the block reach the synchronization point.

  • If I change the implementation to use 32-thread groups instead, the behavior is fine and I do not see this issue.

So I would like to understand:

  1. Is thread_block_tile<64>::sync() guaranteed to wait only for the 64 threads in that tile?

  2. Or can the implementation internally use a block-wide barrier / block-wide participation when tile size is 64?

  3. Is there any architecture-dependent behavior here?

  4. Are there restrictions or caveats for tiled_partition<64> that do not apply to tiled_partition<32>?

I am especially interested in whether this is:

  • expected behavior,

  • a limitation of the implementation,

  • or a bug / unsupported usage pattern.

I’m using CUDA13 and NVIDIA RTX A6000

I’ve asked 4 LLMs and they consistently say that from hardware point of view sync is possible only up to 32threads and suggested to not sync on tiles larger than 32…

Thanks.

Can you show the kernel which makes you believe that

tile64.sync() appears to stall until both halves of the block reach the synchronization point.

?

Thread block tiles should only synchronize the respective threads in the group. Since this is not supported in hardware for tiles >32 threads, it is emulated. The implementation is publicly available in the cuda include directory

well the kernel is quite complex and cannot be shared but after a lot of debugging I’ve found that the first 64 threads were blocked in sync() this way (blocked for hundreds of ms, because each 64 threads are async):

        auto tile64 = cg::tiled_partition<64>(block);

        tile64.sync();

Changing to tile32 everything works as expected.

AI is suggesting that tile64 is performing a
BAR.SYNC — full block barrier (__syncthreads())

While tile32 (or lower) are performing subgroup/warp barriers.

I didn’t trusted them, but the results I got seems to show that tile64.sync() is not working as expected

You should be able to check the ptx or sass code to verify there is no bar.sync. For example, this minimal code does not make use of bar.sync within tile64.sync(). (I added printfs to make it more clear which instructions belong to tile64.sync()).

#include <cooperative_groups.h>
#include <cstdio>

namespace cg = cooperative_groups;

__global__ void my_kernel()
{
    auto block = cg::this_thread_block();
    auto tile64 = cg::tiled_partition<64>(block);

    printf("A\n");
    tile64.sync();
    printf("B\n");
}
mov.u64 	%rd4, $str;
	cvta.global.u64 	%rd5, %rd4;
	{
		st.param.b64 	[param0+0], %rd5;
		st.param.b64 	[param1+0], 0;
		call.uni (retval0),
		vprintf,
		(
		param0,
		param1
		);
		ld.param.b32 	%r16, [retval0+0];
	}
	and.b32  	%r17, %r14, 134217726;
	mov.u32 	%r18, 3;
	shl.b32 	%r4, %r18, %r17;
	bar.warp.sync 	-1;
	@%p2 bra 	$L__BB0_6;
	atom.or.acq_rel.cta.b32 %r19,[%rd2],%r3;
	or.b32  	%r21, %r19, %r3;
	and.b32  	%r22, %r21, %r4;
	setp.eq.s32 	%p3, %r22, %r4;
	@%p3 bra 	$L__BB0_5;
	bra.uni 	$L__BB0_4;
$L__BB0_5:
	not.b32 	%r25, %r4;
	red.and.relaxed.cta.b32 [%rd2],%r25;
	bra.uni 	$L__BB0_6;
$L__BB0_4:
	ld.acquire.cta.u32 %r23,[%rd2];
	and.b32  	%r24, %r23, %r3;
	setp.eq.s32 	%p4, %r24, 0;
	@%p4 bra 	$L__BB0_6;
	bra.uni 	$L__BB0_4;
$L__BB0_6:
	bar.warp.sync 	-1;
	mov.u64 	%rd9, $str$1;
	cvta.global.u64 	%rd10, %rd9;
	{
		st.param.b64 	[param0+0], %rd10;
		st.param.b64 	[param1+0], 0;
		call.uni (retval0),
		vprintf,
		(
		param0,
		param1
		);
		ld.param.b32 	%r26, [retval0+0];
	}

Thanks for the example, I was able to detect my problem which is NOT on sync() but in the previous line, the one creating the partition.

if (threadPerCarrier == 64) {
auto tile64 = cg::tiled_partition<64>(block);
tile64.sync();
}
else if (threadPerCarrier == 32)
{
auto tile32 = cg::tiled_partition<32>(block);
tile32.sync();
}

The if caused warp 0 and warp 1 to participate on tile_partition<64> but warp 2 and 3 do not…
and parent block collective semantics are violated

From NVIDIA docs: "“Partitioning a group is a collective operation and all threads in the group must participate.” and ““The implementation may cause the calling thread to wait until all the members of the parent group have invoked the operation before resuming execution.””

For some reason when using tile32 it “accidentally work” because (probably!) only one warp is involved and the hardware naturally tracks active lanes (?)

CONCLUSION: I’m going to move out the tiled_partition from the if

SUGGESTION: nvcc should probably raise a warning when creating tiled_partitions inside a if, because “sometimes it works”

bar.sync is the most efficient way to synchronize 64 threads only (from two warps).

bar.sync uses a numbered slot 0..15 and a number of warps*32 to wait for as many threads.

It not necessarily waits for the whole block.

If I remember correctly, cooperative groups don’t use bar.sync, but a slower mechanism.