Shared Memory questions

werder85 · September 2, 2010, 10:30am

I’ve read the programming guide and the best practices guide but I still have some questions on shared memory.
In the programming guide, the followind is stated:
“The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory, the amount of dynamically allocated shared memory, and for devices of compute capability 1.x, the amount of shared memory used to pass the kernel’s arguments”.

Here I have two questions:

Is dynamically allocated mameory used only for external arrays? If there are more external arrays, does the amount of dynamically shared memory allocated at the kernel launch have to be the sum of all the external arrays?
When the kernels arguments are passed through shared memory (max 256B), are they copied inside the shared memory of each Multiprocessor?

One last question is related to a question in the best practices guide:
“Devices of compute capability 2.0 have the additional ability to multicast shared memory accesses.”

What does multicasting mean? Does it refer to the fact that for CC 2.0, if different threads access any bytes within the same 32-bit word, there is no bank conflict between these threads?

Thanks a lot for your help!

tera · September 2, 2010, 11:12am

Yes, only [font=“Courier New”]external shared[/font] memory is dynamically allocated. Multiple arrays declared as [font=“Courier New”]external shared[/font] overlap each other, so that the kernel parameter for the size of dynamically shared memory should be the maximum of the array sizes, not their sum.

However, these overlapping arrays are of very limited use, so usually (almost universally) one will only have a single [font=“Courier New”]external shared[/font] array, and manually segment it into multiple non-overlapping regions. See Appendix B.2.3 of the Programming Guide for the technique to do this.

Yes. Actually, they are copied for each block.

Yes, exactly. If multiple threads access the same 32-bit word, the content is multicast to them in one cycle, instead of using multiple cycles with a single transfer each.

tera · September 2, 2010, 11:12am

Yes, only [font=“Courier New”]external shared[/font] memory is dynamically allocated. Multiple arrays declared as [font=“Courier New”]external shared[/font] overlap each other, so that the kernel parameter for the size of dynamically shared memory should be the maximum of the array sizes, not their sum.

However, these overlapping arrays are of very limited use, so usually (almost universally) one will only have a single [font=“Courier New”]external shared[/font] array, and manually segment it into multiple non-overlapping regions. See Appendix B.2.3 of the Programming Guide for the technique to do this.

Yes. Actually, they are copied for each block.

Yes, exactly. If multiple threads access the same 32-bit word, the content is multicast to them in one cycle, instead of using multiple cycles with a single transfer each.

UltraRacerX · September 2, 2010, 5:30pm

I’m struggling to understand what use overlapping dynamic arrays offer in any HPC application. My instinct is that they are of zero use. Maybe someone could enlighten me.

al

UltraRacerX · September 2, 2010, 5:30pm

I’m struggling to understand what use overlapping dynamic arrays offer in any HPC application. My instinct is that they are of zero use. Maybe someone could enlighten me.

al

tera · September 2, 2010, 6:29pm

Two arrays that are not in use at the same time. Of course you may still consider that as zero use, since this can equally well be modeled using a union.

My personal suspicion is, that no other meaningful semantics for this construct could be found, so the CUDA implementers left it at what was easiest to implement for them.

tera · September 2, 2010, 6:29pm

Two arrays that are not in use at the same time. Of course you may still consider that as zero use, since this can equally well be modeled using a union.

My personal suspicion is, that no other meaningful semantics for this construct could be found, so the CUDA implementers left it at what was easiest to implement for them.

Topic		Replies	Views
Questions about shared memory and branching CUDA Programming and Performance	2	2825	September 5, 2009
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008
Shared memory Is it context switched? CUDA Programming and Performance	9	11282	December 6, 2007
Using Shared Memory in CUDA C/C++ Technical Blog	36	1997	October 8, 2020
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
Doubts related to CUDA CUDA Programming and Performance	17	11811	November 18, 2010
Shared memory using structure instead of array CUDA Programming and Performance	7	1327	February 29, 2020
Shared Memory Bank Conflicts CUDA Programming and Performance	3	2317	February 24, 2012
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4044	December 24, 2008

Shared Memory questions

Related topics