I have a kernel with ~40 CTAs. Everyone writes to global memory with atomic operations. After writing, it waits for all CTAs finish writing before resume processing.

I’d like to implement with ‘ and’ instructions (or if there are any other options, that is great too).

Any examples?

Are there any docs that explain ‘ and’ with more detials? the ptx document is a bit hard to understand.