Synchronizing only subset of CUDA warps in block

Is it possible to synchronize only a subset of the warps in a CUDA block? The effect should be between that of __syncwarp() and __syncthreads().

Can this be done with the PTX instruction bar.sync? And if yes, is it also possible to synchronize a non-contiguous range of warps?

It seems that with the cooperative groups API, it is only possible to sync a smaller subsegment of each warp, or one or multiple blocks, but not a range of warps inside the block.

If it is not possible with an intrinsic, can it be manually done using atomics (for example), in an efficient way?

The scenario is that all threads in a block of 6 warps (for example) execute a program like the following:

copy data from global memory to shared buffer
work in shared buffer
work in shared buffer

After the first __syncwarps(0…3), the data that was copied by warps 3…6 (for example) is not yet needed.

So the program syncs only warps 0…3, and then does computations in these warps in the shared buffer. At the same time warps 3…6 continue the global to shared copy.

Then it synchronizes warps 0…6, and then warps 0…6 rework on all the data in the shared buffer.

Is there also a way to let global to shared copies asynchronously continue (i.e. instruction level parallelism) until the copied data is accessed, even when there are __syncthreads() inbetween? It would need some way to inform the compiler when there is a data dependency.

Yes, bar.sync can do that via it’s optional second argument.

Pass the number of warps times the warpsize (i.e. 32) as the second argument, as that is defined to be the number of participating threads, not warps.

If you want to implement a producer-consumer scheme , you can also mix bar.sync and bar.arrive instructions on the same barrier resource.

Just to complement what @tera explained, search for “cuda cooperative groups”. It may contain information you want.