Lots of barrier operations have a .relaxed option, see the PTX ISA User Guide
The optional .sem qualifier specifies a memory synchronizing effect as described in the Memory Consistency Model. If the .sem qualifier is absent, .release is assumed by default.
or
The optional .relaxed qualifier on barrier.cluster.arrive specifies that there are no memory ordering and visibility guarantees provided for the memory accesses performed prior to barrier.cluster.arrive.
visibility!? I think arrive is to tell other threads that, this thread arrived here, and is waiting for other threads! But memory should be visible to other thread…? If not, why we use an arrive here? Any example?
We can have performed memory changes before the barrier.arrive. Then if the other threads are notified, that this thread arrived here, those other threads assume that they now can continue with some work.
When those threads continue and load the data we have written to in the beginning, they will see the changes, or they possibly won’t.
Thank you very much for your answer. I am actually learning cutlass, and here it uses arrive.relaxed. I know before this, we just finished initialization of the template, no memory inter-dependency. So why we still have this arrive? I think we can just delete it!
My thinking is, using arrive.relaxed is useless… Any example?
I think the meaning of arrive/wait is that, it can be overlapped with the computation within arrive-wait. But how much arrive/wait itself will take? I mean, does this really have performance improvement? Any doc or paper here?
It is not about past memory transactions, but about future memory transactions.
For past memory transactions arrive.cluster.barrier.relaxed does give no guarantees, but future memory transactions of other threads will not have happened before the barrier, because the other threads doing those transactions wait, until we are also at the barrier.
I give you an example (pseudo-code):
volatile int x = 0;
volatile int y = 0;
volatile int z = 0;
void threada()
{
assert(z == 0); // only we write to z and we know its value
z = 2;
assert(z == 2); // only we write to z and we know its value
assert(x == 0 && y == 0); // guaranteed true, but only because of the barrier, otherwise y could have changed in threadb()
barrier.relaxed();
x = 4;
}
void threadb()
{
assert(z == 0 || z == 2); // no guarantee, could be either way
assert(x == 0 && y == 0); // guaranteed true, but only because of the barrier, otherwise x could have changed in threada()
barrier.relaxed();
assert(z == 0 || z == 2); // no guarantee, could be either way, even here, the barrier does not give memory visibility guarantees;
// z could still be seen as 0, although z = 2 definitely was executed in threada() at this point of time!
// barrier() without relaxed would have given us visibility of z == 2.
y = 3;
}
Please look carefully at the asserts for x, y and z. The barrier helps us to guarantee that x and y are 0 before the barrier, because x and y are written to after the barrier.
One uses arrive/wait for correctness.
Perhaps there is a simpler, less performant, but correct version possible without arrive/wait by doing more crude synchronizations instead.
For optimal performance one overlaps as much as possible.
You could compile both versions and try out.