Shared Memory questions

I’ve read the programming guide and the best practices guide but I still have some questions on shared memory.
In the programming guide, the followind is stated:
“The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory, the amount of dynamically allocated shared memory, and for devices of compute capability 1.x, the amount of shared memory used to pass the kernel’s arguments”.

Here I have two questions:

  1. Is dynamically allocated mameory used only for external arrays? If there are more external arrays, does the amount of dynamically shared memory allocated at the kernel launch have to be the sum of all the external arrays?
  2. When the kernels arguments are passed through shared memory (max 256B), are they copied inside the shared memory of each Multiprocessor?

One last question is related to a question in the best practices guide:
“Devices of compute capability 2.0 have the additional ability to multicast shared memory accesses.”

  1. What does multicasting mean? Does it refer to the fact that for CC 2.0, if different threads access any bytes within the same 32-bit word, there is no bank conflict between these threads?

Thanks a lot for your help!

Yes, only [font=“Courier New”]external shared[/font] memory is dynamically allocated. Multiple arrays declared as [font=“Courier New”]external shared[/font] overlap each other, so that the kernel parameter for the size of dynamically shared memory should be the maximum of the array sizes, not their sum.

However, these overlapping arrays are of very limited use, so usually (almost universally) one will only have a single [font=“Courier New”]external shared[/font] array, and manually segment it into multiple non-overlapping regions. See Appendix B.2.3 of the Programming Guide for the technique to do this.

Yes. Actually, they are copied for each block.

Yes, exactly. If multiple threads access the same 32-bit word, the content is multicast to them in one cycle, instead of using multiple cycles with a single transfer each.

Yes, only [font=“Courier New”]external shared[/font] memory is dynamically allocated. Multiple arrays declared as [font=“Courier New”]external shared[/font] overlap each other, so that the kernel parameter for the size of dynamically shared memory should be the maximum of the array sizes, not their sum.

However, these overlapping arrays are of very limited use, so usually (almost universally) one will only have a single [font=“Courier New”]external shared[/font] array, and manually segment it into multiple non-overlapping regions. See Appendix B.2.3 of the Programming Guide for the technique to do this.

Yes. Actually, they are copied for each block.

Yes, exactly. If multiple threads access the same 32-bit word, the content is multicast to them in one cycle, instead of using multiple cycles with a single transfer each.

I’m struggling to understand what use overlapping dynamic arrays offer in any HPC application. My instinct is that they are of zero use. Maybe someone could enlighten me.

al

I’m struggling to understand what use overlapping dynamic arrays offer in any HPC application. My instinct is that they are of zero use. Maybe someone could enlighten me.

al

Two arrays that are not in use at the same time. Of course you may still consider that as zero use, since this can equally well be modeled using a union.

My personal suspicion is, that no other meaningful semantics for this construct could be found, so the CUDA implementers left it at what was easiest to implement for them.

Two arrays that are not in use at the same time. Of course you may still consider that as zero use, since this can equally well be modeled using a union.

My personal suspicion is, that no other meaningful semantics for this construct could be found, so the CUDA implementers left it at what was easiest to implement for them.