What is the difference between barrier.arrive.relaxed and barrier.arrive?

202476410arsmart · August 1, 2024, 12:34pm

what is the difference between barrier.arrive.relaxed and barrier.arrive?

CUTE_DEVICE void cluster_arrive_relaxed()
{
#if defined(CUTE_ARCH_CLUSTER_SM90_ENABLED)
  asm volatile("barrier.cluster.arrive.relaxed.aligned;\n" : : );
#else
  CUTE_INVALID_CONTROL_PATH("CUTE_ARCH_CLUSTER_SM90_ENABLED is not defined");
#endif
}

CUTE_DEVICE void cluster_arrive()
{
#if defined(CUTE_ARCH_CLUSTER_SM90_ENABLED)
  asm volatile("barrier.cluster.arrive.aligned;\n" : : );
#else
  CUTE_INVALID_CONTROL_PATH("CUTE_ARCH_CLUSTER_SM90_ENABLED is not defined");
#endif
}

Curefab · August 1, 2024, 12:43pm

Lots of barrier operations have a .relaxed option, see the PTX ISA User Guide

The optional .sem qualifier specifies a memory synchronizing effect as described in the Memory Consistency Model. If the .sem qualifier is absent, .release is assumed by default.

or

The optional .relaxed qualifier on barrier.cluster.arrive specifies that there are no memory ordering and visibility guarantees provided for the memory accesses performed prior to barrier.cluster.arrive.

202476410arsmart · August 1, 2024, 12:50pm

visibility!? I think arrive is to tell other threads that, this thread arrived here, and is waiting for other threads! But memory should be visible to other thread…? If not, why we use an arrive here? Any example?

Thanks!

Curefab · August 1, 2024, 1:14pm

We can have performed memory changes before the barrier.arrive. Then if the other threads are notified, that this thread arrived here, those other threads assume that they now can continue with some work.
When those threads continue and load the data we have written to in the beginning, they will see the changes, or they possibly won’t.

202476410arsmart · August 2, 2024, 2:21am

Thank you very much for your answer. I am actually learning cutlass, and here it uses arrive.relaxed. I know before this, we just finished initialization of the template, no memory inter-dependency. So why we still have this arrive? I think we can just delete it!

My thinking is, using arrive.relaxed is useless… Any example?

202476410arsmart · August 2, 2024, 2:48am

oh, I find out that “wait” is overlapping with loading A/B matrix, right?

202476410arsmart · August 2, 2024, 7:20am

I think the meaning of arrive/wait is that, it can be overlapped with the computation within arrive-wait. But how much arrive/wait itself will take? I mean, does this really have performance improvement? Any doc or paper here?

Thanks!

Curefab · August 2, 2024, 8:27am

It is not about past memory transactions, but about future memory transactions.

For past memory transactions arrive.cluster.barrier.relaxed does give no guarantees, but future memory transactions of other threads will not have happened before the barrier, because the other threads doing those transactions wait, until we are also at the barrier.

I give you an example (pseudo-code):

volatile int x = 0;
volatile int y = 0;
volatile int z = 0;

void threada()
{
    assert(z == 0); // only we write to z and we know its value
    z = 2;
    assert(z == 2); // only we write to z and we know its value
    assert(x == 0 && y == 0); // guaranteed true,  but only because of the barrier, otherwise y could have changed in threadb()
    barrier.relaxed();
    x = 4;
}

void threadb()
{
    assert(z == 0 || z == 2); // no guarantee, could be either way
    assert(x == 0 && y == 0); // guaranteed true,  but only because of the barrier, otherwise x could have changed in threada()
    barrier.relaxed();
    assert(z == 0 || z == 2); // no guarantee, could be either way, even here, the barrier does not give memory visibility guarantees;
                              // z could still be seen as 0, although z = 2 definitely was executed in threada() at this point of time!
                              // barrier() without relaxed would have given us visibility of z == 2.
    y = 3;
}

Please look carefully at the asserts for x, y and z. The barrier helps us to guarantee that x and y are 0 before the barrier, because x and y are written to after the barrier.

Curefab · August 2, 2024, 8:37am

One uses arrive/wait for correctness.
Perhaps there is a simpler, less performant, but correct version possible without arrive/wait by doing more crude synchronizations instead.
For optimal performance one overlaps as much as possible.
You could compile both versions and try out.

Topic		Replies	Views
Understanding the Role of arrive in NamedBarrier Synchronization CUDA Programming and Performance	1	197	December 19, 2024
Understanding bar.sync and the Role of thread_count in bar.arrive CUDA Programming and Performance	2	245	December 20, 2024
Issue using bar.cta.sync and arrive CUDA Programming and Performance nvbugs	5	837	September 26, 2023
What is cuda::barrier? why we have this? CUDA Programming and Performance	2	3439	October 3, 2023
Differences and Compatibility Between mbarrier and barrier in PTX CUDA Programming and Performance	1	621	April 24, 2025
Does bar.sync Emit Semaphores Alongside bar.arrive? CUDA Programming and Performance	2	95	December 20, 2024
Dealing with relaxed memory consistency model CUDA Programming and Performance	5	1545	February 13, 2010
Can __threadfence_system in function nvshmemi_barrier_threadgroup changed to fence.acq_rel.sys ？ GPU-Accelerated Libraries nvshmem	7	250	June 27, 2025
Controlling context switching in CUDA CUDA Programming and Performance	16	4629	April 29, 2013
using PTX barrier.sync CUDA Programming and Performance	12	4523	March 27, 2019

What is the difference between barrier.arrive.relaxed and barrier.arrive?

Related topics