Why early exit will influence later __syncthreads()?

In CUTLASS, I noticed load_tail and producer_tail, which aim to prevent early exit. Does this mean that an entire cluster must finish simultaneously? Why do I find that this affects subsequent __syncthreads()?


7d6eca0338fdb92e208d4369b87ea48

previous my code version is:

while(…){

__syncthreads();

}
load_tail
---------->>>>>>>>>It loop forever

Now my code is:

while(…){

load_tail
__syncthreads();

}

---------->>>>>>>>>It works well.

Why??

If distributed shared memory is used, a block A must not exit early if there are other thread blocks B within the cluster that need to access the distributed shared memory of block A.

1 Like

Thanks!
well, below will loop forever is strange!

while(…){
…
__syncthreads();
…
}
load_tail
---------->>>>>>>>>It loop forever

Because, you see, in the end, we have load_tail to prevent early exit…