Issue with inter-block communication


Let’s say I have n running blocks. Some thread from each block copies shared memory to global memory. I need a mechanism that guarantees that all blocks get the updated (correct) version of global memory after this copy. The classic example with __threadfence used in “threadfence reduction” forces the last block to sum the results of all other blocks, which is not what I want. I also wish not to use two successive kernels for this. How can I do it? Thanks.

Your desire for inter-block communication runs counter to the fundamental principle of CUDA computation that blocks execute independently of each other. Which in turn leads to the two solutions you discarded.

I would suggest to either rethink your use case in terms of the CUDA programming model, or pick another computing platform with another programming model that provides the features you want. IMHO, there is no point hammering square pegs into round holes.

On a pascal or volta GPU with CUDA 9 you can use a cooperative grid launch. it’s part of cooperative groups. Feel free to google for more information about cooperative groups, or read the cuda 9 programming guide section.