Kernel Execution issues related to Shared Memory

yangxin · November 9, 2009, 7:35pm

Hi, all,

In CUDA, each multiprocessor has 16Kb of on chip shared memory.

Is this 16Kb split to all thread blocks scheduled to be executed on this multiprocessor? E.g., in Tesla C1060, I have 30 multiprocessors.
My kernel has 120 blocks, 4 blocks are assigned to 1 processor. Does it mean I can only allocate 4Kb of shared memory per block at maximum?
Even though these 4 blocks won’t be executing programs at the same time on one processor.

Another possibility is 1 block can occupy all the 16Kb shared memory when it is doing computation on that processor. But if so, when this block is stopped in the
middle of the computation (maybe waiting for memory load from global memory ) and kicked out of the multiprocessor,
another block will be activated to execute instructions on this processor, then will the current values of shared memory for the previous block be backed up?
Will these values be copied out first, then the new block overwrite all the values in shared memory, when the previous block is activated again, the old values will
be copied back.

I didn’t find any clear explanations on this issue from programming guide. One sentence in Section 4.1 said, “If there are not enough registers or shared memory available per
multiprocessor to process at least one block, the kernel will fail to launch.” Does this imply that as long as the shared memory can satisfy one block, the kernel can be launched?
like the second possibility I mentioned.

Thank you,

tmurray · November 9, 2009, 7:38pm

Not all blocks have to be resident at once. If you only use 4KB of shared memory and less than 1/4 of other resources per SM per block in your first example, all 120 blocks could run in parallel (not guaranteed). If you use more than that, fewer blocks will run concurrently, and as a block finishes another block will launch to take its spot.

The second case is impossible; once a block has been assigned to an SM, it does not leave the SM until it’s completed.

avidday · November 9, 2009, 7:44pm

That is what happens.

That isn’t what happens. Blocks aren’t paged in and out of running multiprocessors. They are scheduled and run until completion. Scheduling, memory access, instruction pipelining, and execution all happen at the warp or half warp level, rather that at the block level. It is warps which are context switched in and out of any given multiprocessor, not whole blocks.

yangxin · November 9, 2009, 7:54pm

Thank you for the reply. This clarifies my confusion.

yangxin · November 9, 2009, 8:03pm

Thank you for your clear explanation. I thought scheduling can happen at the block level.

From your reply, I think the blocks are mapped to multiprocessors at the beginning of kernel launch. As shown in Figure 4-1 of programming guide,

if I have 30 multiprocessors, then block0, 30, 60,… will be assigned to SM0 for execution; block1, 31, 61, … will be on SM1.

Will this be fixed during the kernel execution? If the blocks on SM1 finishes first, will the hardware move the blocks on SM0 onto SM1?

Thank you,

tmurray · November 9, 2009, 8:03pm

Blocks are never ever moved for any reason. Once they’re assigned, that’s it, the end.

The ordering is not guaranteed.

Topic		Replies	Views
Execution Of Thread-Blocks CUDA Programming and Performance	4	5282	June 18, 2007
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	3986	August 14, 2007
Amount of Shared Memory CUDA Programming and Performance	10	4206	June 3, 2010
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2477	July 4, 2019
Concurrent Kernels On A Given Multiprocessor CUDA Programming and Performance	7	3016	May 30, 2012
Shared memory issues Initialization of shared memory CUDA Programming and Performance	2	6720	August 23, 2007
Shared memory Is it context switched? CUDA Programming and Performance	9	11279	December 6, 2007
Shared Memory and number of Blocks invoked CUDA Programming and Performance	4	5739	March 5, 2008
Scheduling Blocks on a Multi-Processor Block Scheduling on Multiprocessor CUDA Programming and Performance	11	6393	December 6, 2007
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3821	August 23, 2017

Kernel Execution issues related to Shared Memory

Related topics