St.global.release.gpu and ld.global.acquire.gpu

I have a kernel with ~40 CTAs. Everyone writes to global memory with atomic operations. After writing, it waits for all CTAs finish writing before resume processing.

I’d like to implement with ‘st.global.release.gpu and ld.global.acquire.gpu’ instructions (or if there are any other options, that is great too).

Any examples?

Are there any docs that explain ‘st.global.release.gpu and ld.global.acquire.gpu’ with more detials? the ptx document is a bit hard to understand.