Is cluster scope necessary when only CTA0 is performing the try_wait?

In PTX mbarrier.try_wait, I see example like below:

∕∕ Example 4, Synchronizing the CTA0 threads with cluster threads
.reg .b64 %r1, addr, remAddr;
.shared .b64 shMem;
cvta.shared.u64 addr, shMem;
mapa.u64 remAddr, addr, 0; ∕∕ CTA0’s shMem instance
∕∕ One thread from CTA0 executing the below initialization operation
@p0 mbarrier.init.shared::cta.b64 [shMem], N; ∕∕ N = no of cluster threads
barrier.cluster.arrive;
barrier.cluster.wait;
∕∕ Entire cluster executing the below arrive operation
mbarrier.arrive.release.cluster.b64 _, [remAddr];
∕∕ computation not requiring mbarrier synchronization ...
∕∕ Only CTA0 threads executing the below wait operation
waitLoop:
mbarrier.try_wait.parity.acquire.cluser.shared::cta.b64 complete, [shMem], 0;
@!complete bra waitLoop;

However, in the following code, only threads from CTA0 are executing the wait operation:

// Only CTA0 threads executing the below wait operation
waitLoop:
mbarrier.try_wait.parity.acquire.cluster.shared::cta.b64 complete, [shMem], 0;
@!complete bra waitLoop;

I’m wondering why the cluster scope is specified here. Since only CTA0 is performing the try_wait, wouldn’t it work without explicitly using cluster? In this context, would removing cluster have the same effect since the wait operation is limited to CTA0?