Hi,
I am seeing unexpected behavior with Cooperative Groups and I would like to ask for clarification.
I have a kernel launched with 128 threads per block. Inside the kernel I create a tiled partition of size 64:
**namespace cg = cooperative_groups;
__global__ void my_kernel(...)
{
auto block = cg::this_thread_block();
auto tile64 = cg::tiled_partition<64>(block);
**
// work**
tile64.sync();
**
// more work**
}**
My expectation is that tile64.sync() should synchronize only the 64 threads belonging to each tile.
However, in my test it looks like sync() on the first 64-thread tile is also waiting for the other 64 threads in the block. In other words, it behaves as if the synchronization is block-wide, or at least coupled across the two 64-thread tiles.
What I observe is:
-
Kernel uses 128 threads per block.
-
tiled_partition<64>(block)creates two 64-thread tiles. -
tile64.sync()appears to stall until both halves of the block reach the synchronization point. -
If I change the implementation to use 32-thread groups instead, the behavior is fine and I do not see this issue.
So I would like to understand:
-
Is
thread_block_tile<64>::sync()guaranteed to wait only for the 64 threads in that tile? -
Or can the implementation internally use a block-wide barrier / block-wide participation when tile size is 64?
-
Is there any architecture-dependent behavior here?
-
Are there restrictions or caveats for
tiled_partition<64>that do not apply totiled_partition<32>?
I am especially interested in whether this is:
-
expected behavior,
-
a limitation of the implementation,
-
or a bug / unsupported usage pattern.
I’m using CUDA13 and NVIDIA RTX A6000
I’ve asked 4 LLMs and they consistently say that from hardware point of view sync is possible only up to 32threads and suggested to not sync on tiles larger than 32…
Thanks.