How to use PTX mbarrier.try_wait in cluster?

∕∕ Example 4, Synchronizing the CTA0 threads with cluster threads
(continues on next page)
312 Chapter 9. Instruction Set
PTX ISA, Release 8.5
(continued from previous page)
.reg .b64 %r1, addr, remAddr;
.shared .b64 shMem;
cvta.shared.u64 addr, shMem;
mapa.u64 remAddr, addr, 0; ∕∕ CTA0’s shMem instance
∕∕ One thread from CTA0 executing the below initialization operation
@p0 mbarrier.init.shared::cta.b64 [shMem], N; ∕∕ N = no of cluster threads
barrier.cluster.arrive;
barrier.cluster.wait;
∕∕ Entire cluster executing the below arrive operation
mbarrier.arrive.release.cluster.b64 _, [remAddr];
∕∕ computation not requiring mbarrier synchronization ...
∕∕ Only CTA0 threads executing the below wait operation
waitLoop:
mbarrier.try_wait.parity.acquire.cluser.shared::cta.b64 complete, [shMem], 0;
@!complete bra waitLoop;

If we only try_wait in cta0, how could we stall other CTAs? I mean, after this, I want to do some between-cluster operations and must stall all the threads(but within consumer, so can not use barrier.cluster.wait).