What if the case when I have multiple thread blocks doing the same thing, say traversing from A to A in a coalescing way. But when kernel is launched, all the thread0 of all the thread blocks will read the A from global memory. All the thread1 of all the thread blocks will read the A from global memory and so on so forth. I’m not sure whether these accesses to the same address would have to be serialized or broadcast to all threads?
Devices from compute capability 1.2 onwards have a broadcast mechanism that will send the data to all threads in a (half-) warp in one memory transaction. Devices starting from CC 2.0 have L1 and L2 caches that will ideally reduce the global memory traffic so that each datum is read just once for all blocks running in parallel (although in practice the synchronization between the blocks will probably be lost somewhere on the way, so that data will be read multiple times).
I’m using GTX480. If I want to have all threads with same local index of different blocks read the same datum, do I need to explicitly specify and manage the broadcasting, or it’s automatically applied and I don’t need to do anything?
You don’t need to (and even cannot) specify anything.