-
In the FA3 store function, I observed the following process:
- Data is stored from registers to shared memory.
- A sync occurs.
- Then, data is stored from shared memory to global memory.
-
This sync is a NamedBarrier sync, but I noticed that no arrive operation is performed:
- I searched the corresponding barrier ID and confirmed that no arrive is associated with it.
-
This reminds me of
__syncthreads
, which translates to PTX asbar.sync
and also doesn’t involve an explicitarrive
. -
This raises the question:
- Does this imply that
arrive
is unnecessary for such synchronization scenarios?
- Does this imply that
-
However, I noticed that in other parts of FA3,
arrive
is used. -
Therefore, I’m curious:
- What are the specific conditions or scenarios where
arrive
is required?
- What are the specific conditions or scenarios where
It seems that if we use it as __syncthreads
, there’s no need for arrive
. bar.arrive
is meant for use in WASP, isn’t it? Although using a barrier for WASP feels odd… doesn’t this forcibly require the producer and consumer to have the same participating threads?