Can cutlass::arch::NamedBarrier::sync() Fully Replace __syncthreads in Producer/Consumer Scenarios?

Is it possible to use cutlass::arch::NamedBarrier::sync() to replace __syncthreads when performing shared memory operations solely within either the producer or consumer? My results seem inconsistent:

  • Adding a printf makes the code work correctly, but removing it causes issues. This indicates a missing sync somewhere.
  • Sometimes, adding __syncthreads resolves the problem, but other times it causes deadlocks, especially when used in the consumer side.
  • I’ve also tried adding NamedBarrier.sync() along with fence_view_async_shared, but it didn’t help.

What is the recommended synchronization approach in such scenarios? How should one debug such synchronization issues effectively?