In CUTLASS, I noticed load_tail
and producer_tail
, which aim to prevent early exit. Does this mean that an entire cluster must finish simultaneously? Why do I find that this affects subsequent __syncthreads()
?

previous my code version is:
while(…){
…
__syncthreads();
…
}
load_tail
---------->>>>>>>>>It loop forever
Now my code is:
while(…){
…
load_tail
__syncthreads();
…
}
---------->>>>>>>>>It works well.
Why??