Does shared memory have "broadcast" behavior?

If thread0 and thread1 try to access the same location in shared memory at the same time, how many transactions are needed?

1 Like

I remember I read from somewhere that only 1 transaction is needed.

But the profiler indicates 2 transactions are needed, which is weird.

Yes, there is broadcast on current GPUs. Only one transaction should be needed, assuming the requested type is a naturally aligned 32-bit quantity.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-3-0

1 Like

I’m using the Turing GPU.

And the shared memory instruction is the 128-bit version (LDS.128). Will this make any difference? Should I change back to the 32-bit version?

so that is what I would call a vector load. When doing a vector load per thread, it’s possible that the overall transaction width would exceed the amount of data that the shared memory subsystem can serve up (basically 128 bytes per transaction), even if all the data is nicely grouped together. Therefore a vector load per thread might get broken into two or more transactions, even if there are a few instances of broadcasting.

I’m not sure why you are worried about this, and I don’t have any code so I can’t estimate the impact of the change, so I’m not in a position to advise that you do this any differently. The exact expected behavior cannot be defined in a forum discussion without knowing the exact pattern of addresses distributed across the warp. Even then I would probably test it out to see the actual machine behavior.

If you have two cases under consideration, and want to know which is better, I usually suggest careful bencharking.

1 Like

Really good advice. I will do benchmarking to compare their performance. Thanks a lot.

Lets say you do a shared load of 64 bits per thread (e.g. could be a double per thread, or a vector load of e.g. int2 per thread), and all the addresses are adjacent and aligned. For a LDS.64 instruction, then, the warp is requesting 256 bytes from shared memory. But shared memory (for all current architectures except Kepler) has a maximum bandwidth of 32 bits per bank per transaction. There are 32 banks so this is a maximum delivery payload of 128 bytes per transaction. Therefore the machine will automatically split up such a vector load into 2 transactions. In this particular example there is no loss of efficiency or imperfection in this approach. The machine is delivering data to your code at the maximum possible rate, and the fact that it takes 2 transactions is inevitable, and not a demerit of the way the code is written. In other words, you ought not to worry about that.

My guess would be if you are doing a LDS.128 warp wide, that there is some form of this that is causing the load operation to go from 1 transaction to 2. But again, without an exact address pattern warp-wide to analyze, no concrete conclusions can be drawn.

1 Like

Hi! I have an example here:
in shared memory, we have data like:
0, 1, 2, 3, 4, 5, 6, 7…127
128, 129, 130…255,


.
.
.
.
.
One way to read is to let threadIdx.x=0, 1, 2, 3…7 to read data 0, 1, 2, 3 at the same time, and threadIdx.x=8, 9, 10, 11, …15 to read data 4, 5, 6, 7. using LDS.128, (note all the data in shared memory is float32). And another way is to let threadIdx.x=0, 2, 4, 6, 8, 10, 12, 14 to read 0, 1, 2, 3. And threadIdx.x=1, 3, 5, 7, 9, 11, 13, 15 to read 4, 5, 6, 7…

My question is, why the later way has higher efficienty? Broadcast has any limitations?

I think your previous answer should be highly related to my example here…But I can not really get why they are different…I think if you explain my problem, it can also clearly explain the post topic!

Thank you!!!

I suggest providing an actual code example that someone could compile and run, as well as a definition of how you define and measure efficiency.

1 Like