Is it possible to use cutlass::arch::NamedBarrier::sync()
to replace __syncthreads
when performing shared memory operations solely within either the producer or consumer? My results seem inconsistent:
- Adding a
printf
makes the code work correctly, but removing it causes issues. This indicates a missing sync somewhere. - Sometimes, adding
__syncthreads
resolves the problem, but other times it causes deadlocks, especially when used in the consumer side. - I’ve also tried adding
NamedBarrier.sync()
along withfence_view_async_shared
, but it didn’t help.
What is the recommended synchronization approach in such scenarios? How should one debug such synchronization issues effectively?