Inter-warp synchronization with Jetson Nano

Jetson Nano is compute capability 5.3. This section PTX ISA 8.4 states:

For .target sm_6x or below,

  1. barrier{.cta} instruction without .aligned modifier is equivalent to .aligned variant and has the same restrictions as of .aligned variant.

Does that mean that if I use

barrier.sync 0,64;

that behaves like __syncthreads() causing all threads in a block to synchronize? If so, is there any way to synchronize only a subset of warps in a block with Jetson Nano?


Are you asking for the Independent Thread Scheduling feature?
If yes, it only supports the device with architecture >=7.x:


no I’m not asking for intra-warp sync, but inter-warp. My current need is to synchronize two warps in a block without synchronizing all of them, as __syncthreads() does.
I’m currently experimenting with:

__shared__ volatile uint32_t lock;

lock = 0;
if (lane == 0)  // index of the thread within the warp
    atomicXor((uint32_t *)&lock, 1U);
    while (lock != 0)

that seems to work if just two warps are allowed to reach that code, considering also that in compute capabilities 5.3 all threads in a warp are implicitly synchronized so if one thread stays on hold in the while loop the whole warp is on hold.

But I wonder if a more elegant solution exists, without using shared memory and atomics.


We don’t have such an API.
But you can do it in a logical way as you mentioned.