In scenarios where both producer and consumer threads exist, how can we achieve synchronization using CUTLASS’s barrier.sync/arrive? I understand that in barrier.arrive(a, b), a represents the number of threads required to arrive, and b is the barrier_ID. However, the number of producer and consumer threads is often different.
In FlashAttention3, I saw this example:
Here, the a parameter includes the consumer thread count (256) plus the active threads in the producer (32). However, I don’t understand why it is written this way.