Ideas on data transfer between blocks?

Hello All,

I was wondering if anyone had any ideas on a good way to communicate/transfer data between multiple blocks. I’ll describe a simple scenario of what I am trying to achieve:

The application I am creating is essentially a hierarchy, and the different blocks are different nodes. For a simple example, lets say I have 3 blocks (B0, B1, and B2). For my application, each block takes in 512Bytes of input data. B2 is the ‘parent’ node of blocks B1 and B0. So basically when I run my application, blocks B0 and B1 evaluate and concatenate their outputs (256B + 256B) to be the INPUT of block B2 (for a total of 512B input).

Now the problem is, the global memory I am writing for some blocks OUTPUT is the global memory for another blocks INPUT. Granted, in the 3-block case there is no problem (since I have more than three processors, all 3 blocks can run concurrently. The inputs of the parent are just always one iteration behind the children, but this is no problem). However since we can’t say anything about the block ORDERING (especially when I scale this up to larger numbers of blocks), this cannot be done. And in fact I do see the unexpected behavior, where the child-block has only written some of its outputs before a parent starts reading them, so the inputs I see are always different and unpredictable.

This brings me to the solutions I have thought about so far:

  1. Launch a separate kernel for each layer. Basically complete each layer in block step. I have implemented this, but it definitely takes a performance hit (in my case, almost a 3x!).

  2. I was thinking about using some atomic operations which basically the block would need to acquire a lock before reading or writing the ENTIRE 512Bytes at once, this keeping multiple blocks from reading or writing it at the same time. Any thoughts here on if this is ok or not?

  3. Any other ideas? I have been looking through the forum posts for anyone who has done block communication/data transfer but haven’t found anything exactly like this yet…


You need to launch a separate kernel for each layer. Each block needs to be independent for the cuda model to work, where any number of blocks (even 1) can be launched at a time in any order. Any sort of inter-block communication or synchronization will violate this and can easily cause compatibility problems.

If you wanted to go ahead and try inter-block communication, you should make sure the number of cuda blocks is not more than the number of SMs, otherwise you are almost guaranteed deadlock. With more nodes than SMs, each cuda block could execute multiple nodes in some sequence, with atomic operations to coordinate. You may also need threadfence to ensure that your outputs are fully flushed to memory before the dependent node is allowed to begin. Theoretically this can work, but this kind of inter-block synchronization can be very hard to get right and is strongly discouraged!