In hopper, can TMA broadcast data to multiple blocks?

202476410arsmart · July 23, 2024, 7:19am

I know in shared memory, we have broadcast. But how about TMA? In a cluster, can data be broadcast to multiple blocks?

Curefab · July 23, 2024, 10:50am

In TMA a broadcast makes no sense: The active threads of a warp run in lockstep, that is not true for threads residing in different blocks.

It could be that distributing data in the following fashion is fastest (just by theory, have not tried it out):

one block transmits to one other block
the first block and the other block transmit to 2 other blocks
the four blocks transmit to 4 other blocks
the eight blocks transmit to 8 other blocks

On the other hand, you probably increase latency by synchronization, so transmitting in a loop from the originating block to all other blocks may be the fastest option after all.

I think it is more important to balance the number of inter-block reads and writes of all blocks and optimize bandwidth instead of latency.

So if data is originating in one block only.
Each time transmit to one of the e.g. 15 other blocks (change this block in a round-robin fashion).
And this other block forwards to all blocks in a loop.

Topic		Replies	Views
Ideas on data transfer between blocks? CUDA Programming and Performance	1	979	April 10, 2009
Does global memory has kinda broadcasting mechanism? CUDA Programming and Performance	4	9183	November 9, 2011
Multiple memcpu HostToDevice in parallel ? or how to fake broadcast to several GPU CUDA Programming and Performance	0	5203	August 25, 2007
Why does it help to use more thread blocks? CUDA Programming and Performance	4	4279	December 6, 2010
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8361	February 17, 2008
Selective Broadcast Mechanism is it possible? CUDA Programming and Performance	5	2544	December 11, 2009
Example for data broadcast CUDA Programming and Performance	1	2735	March 9, 2009
Texture Memory Cache Broadcast mechanism? CUDA Programming and Performance	4	5491	March 17, 2008
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5512	March 6, 2007
bank conflict CUDA Programming and Performance	1	2077	June 21, 2008

In hopper, can TMA broadcast data to multiple blocks?

Related topics