Shared memory per block Related to shared memory of an MCPU

e.ping · July 17, 2007, 2:51pm

Hello,
I’ve just started CUDA development. There’s one thing that doesn’t get clear to me. When running e.g. the deviceQuery (contained in the SDK), it says:
Total amount of shared memory per block: 16384 bytes
Appendix A of the user guide says:
The amount of shared memory available per multiprocessor is 16 KB

What’s confusing me is the relation between blocks and multiprocessor. As I understand it each blocks runs on a multiprocessor, but there can run multiple blocks on one multiprocessor. Does this mean that all the blocks on the multiprocessor need to share the 16KB of memory on the MCPU or is the shared memory swapped to e.g. global memory, so that each block can always access the full 16KB of shared memory?

Assuming they have to share the shared memory. If I run a kernel with the maximum number of blocks (65536). The blocks will be divided over the 16 MCPU’s this means there run 65536 / 16 = 4096 blocks on each MCPU. Does this mean that each block can allocate a maximum of 16384 / 4096 = 4 bytes of shared memory?

Thanks,
Michiel

seibert · July 17, 2007, 4:55pm

The runtime looks at how much shared memory your kernel requires, and will run more than one block per multiprocessor if there is enough shared memory to do so. If each block only requires 5 kB of shared memory, then up to 3 blocks can be run simultaneously per multiprocessor (leaving some extra for kernel parameters, which get loaded into shared memory also). If you use 15 kB of shared memory in a block, then the scheduler will only run 1 block per multiprocessor. Blocks that cannot be scheduled will run after the first blocks finish. So if you request 65536 blocks, they will all run, but not all at the same time.

Running multiple blocks per multiprocessor helps mitigate some of the global memory latency, so it is good to keep shared memory usage low if possible, but it is not required.

lars · August 12, 2007, 1:13pm

Does this apply to the register file as well, i.e must two blocks use less than 8192/2 registers each in order to be launched simultaneously?

/Lars

PINS · August 14, 2007, 2:52am

Basically, yes.

Shared memory and registers are partitioned among the threads of all concurrent blocks. So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently in the same SM.

Topic		Replies	Views
Shared Memory and number of Blocks invoked CUDA Programming and Performance	4	5749	March 5, 2008
NEWBIE:max size of shared memory of a block? CUDA Programming and Performance	3	3123	September 5, 2009
Execution Of Thread-Blocks CUDA Programming and Performance	4	5296	June 18, 2007
So how much shared mem do we really have ? knowing cuda hw better = better optimization CUDA Programming and Performance	0	1922	November 20, 2009
Kernel Execution issues related to Shared Memory CUDA Programming and Performance	5	5174	November 9, 2009
Shared Memory Confusion CUDA Programming and Performance	6	3792	June 16, 2008
Not enough shared mem CUDA Programming and Performance	5	5805	November 3, 2009
Per Block/Multiprocessor CUDA Programming and Performance	2	10262	September 1, 2011
Question about max shared memory in block and multiprocessor CUDA Programming and Performance	2	1610	February 20, 2024
mapping blocks to GPU SM's? CUDA Programming and Performance	5	12853	April 28, 2010

Shared memory per block Related to shared memory of an MCPU

Related topics