I know in shared memory, we have broadcast. But how about TMA? In a cluster, can data be broadcast to multiple blocks?
In TMA a broadcast makes no sense: The active threads of a warp run in lockstep, that is not true for threads residing in different blocks.
It could be that distributing data in the following fashion is fastest (just by theory, have not tried it out):
- one block transmits to one other block
- the first block and the other block transmit to 2 other blocks
- the four blocks transmit to 4 other blocks
- the eight blocks transmit to 8 other blocks
On the other hand, you probably increase latency by synchronization, so transmitting in a loop from the originating block to all other blocks may be the fastest option after all.
I think it is more important to balance the number of inter-block reads and writes of all blocks and optimize bandwidth instead of latency.
So if data is originating in one block only.
Each time transmit to one of the e.g. 15 other blocks (change this block in a round-robin fashion).
And this other block forwards to all blocks in a loop.