Let’s say I have n running blocks. Some thread from each block copies shared memory to global memory. I need a mechanism that guarantees that all blocks get the updated (correct) version of global memory after this copy. The classic example with __threadfence used in “threadfence reduction” forces the last block to sum the results of all other blocks, which is not what I want. I also wish not to use two successive kernels for this. How can I do it? Thanks.