Does global memory has kinda broadcasting mechanism?

bit_mapper · November 2, 2011, 1:18am

What if the case when I have multiple thread blocks doing the same thing, say traversing from A[0] to A[1000] in a coalescing way. But when kernel is launched, all the thread0 of all the thread blocks will read the A[0] from global memory. All the thread1 of all the thread blocks will read the A[1] from global memory and so on so forth. I’m not sure whether these accesses to the same address would have to be serialized or broadcast to all threads?

Thanks.

tera · November 2, 2011, 1:52am

Devices from compute capability 1.2 onwards have a broadcast mechanism that will send the data to all threads in a (half-) warp in one memory transaction. Devices starting from CC 2.0 have L1 and L2 caches that will ideally reduce the global memory traffic so that each datum is read just once for all blocks running in parallel (although in practice the synchronization between the blocks will probably be lost somewhere on the way, so that data will be read multiple times).

bit_mapper · November 2, 2011, 9:20pm

I’m using GTX480. If I want to have all threads with same local index of different blocks read the same datum, do I need to explicitly specify and manage the broadcasting, or it’s automatically applied and I don’t need to do anything?

tera · November 2, 2011, 11:51pm

You don’t need to (and even cannot) specify anything.

bit_mapper · November 9, 2011, 7:16pm

Thank you tera!

Topic		Replies	Views
global memory broadcast? reading same global memory location with multiple blocks CUDA Programming and Performance	2	4835	June 6, 2011
Global memory broadcasting? CUDA Programming and Performance	4	5748	October 2, 2008
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8361	February 17, 2008
Global memory broadcast CUDA Programming and Performance	2	9111	July 4, 2011
Memory coalescing in one thread CUDA Programming and Performance	17	16674	March 31, 2011
Texture Memory Cache Broadcast mechanism? CUDA Programming and Performance	4	5491	March 17, 2008
coalescing problem CUDA Programming and Performance	4	1083	August 8, 2011
How the access of the same global memory address is performed by threads from different kernels? CUDA Programming and Performance	2	753	January 23, 2013
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5512	March 6, 2007
Coalesced Access to Global Memory CUDA Programming and Performance	2	1898	April 13, 2012

Does global memory has kinda broadcasting mechanism?

Related topics