Shared memory Is it context switched?

I am somewhat confused over the CUDA architecture. If a thread/warp/block needs a slow memory access from the device memory, it can be masked by executing threads from an other block. This requires pre-emption of the current warp. But what about the shared memory? Is it a part of the block context and pre-emptied at the same time?

– Kuisma

Shared memory is divided between blocks, as well as the registers are. If there is not enough shared memory (or registers) for an extra block at the same time in the same multiprocessor, it won’t be scheduled for concurrent execution. These are known as “split resources” in simultaneous multithreading.

But would not this require offline scheduling, i.e. a precalculated schedule? With non-deterministic memory accesses causing preemption , this would not be possible…?

  1. That scheduling can easily be done offline. Based on the register usage and shared memory usage, you (or the driver) can easily determine how many BLOCKS can run concurrently on a single multiprocessor. This number is constant because those resource usages are constant for each block. See the forumulas in the programming guide for the forumulas or download the occupancy calculator to play with it.

  2. At the warp level, you need to imagine that after every single INSTRUCTION executed, another warp is likely to be swapped in. The latecny is much, much less with shared memory compared to global memory, but the same interleaving occurs as a result. Heck, there is even interleaving at the register level since values written to registers cannot be read for a few clocks (register read after write dependancies).

There is no preemption in the same way there is preemption in multi task environments. There is no context switch. It’s more like if the hardware could maintain several execution contexts simultaneously (in fact, as the resources are split, it maintains all the execution contexts; r0 is not the same register for thread0 than for thread1).

So the driver does static scheduling based on how many blocks can run at the same time in the GPU. If there are enough resources (registers+shared memory+“warp slots”+“block slots”), I just keep adding blocks to the schedule. The time is shared between threads in a fine-grained multithreading fashion (you can easily find “understandable” papers about coarse-grained, fine-grained and simultaneous multithreading so you can better understand how time is shared in GPUs).

I don’t know if I’m answering your question. I think you are misunderstanding how multithreading is actually performed by hardware… but may be I’m wrong. :blink:

I think I know what’s confusing me. :)

I must be wrong here. Memory reads can only be masked by the execution of threads already scheduled for the current multiprocessor, that is, very likely threads of the same block, right?

I got the (incorrect) impression that more blocks in the grid would help getting better performance during device memory accesses, and this would have required some kind of interblock scheduler, context switches etc. But blocks are more or less batch scheduled? Once (one or more) assigned to a multiprocessor, it (they) will use it exclusively until terminated?

Yes, you are right. I need to play around more.


Given the kernel properties (number of registers used, amount of shared memory used) it is determined how many blocks can execute on one multiprocessor at once, which is up to 8. This number is static during the entire kernel execution, so the hardware knows that when a block ends, it can start a new one.

So more blocks in a grid can get you better performance, if the kernel properties allow more blocks to be executed on a multiprocessor in parallel.

Ok, I think I’m enlightened. Thank you folks. :)

Wumpus, so if I have overall 9 thread blocks for each multiprocessor to execute, but the shared memory usage allows only 5 thread blocks to be concurrently run on one multiprocessor, then

Question 1:

There would be no thread-block swapping and hence no context switching during the entire run-time of the 5 thread-block. Only when they are all terminated, the other 4 thread-block would get schedule to the multiprocessor. Is it correct?

Question 2:

What if one of the 5 thread-block running on the multiprocessor terminates earlier than the others (for some reason)? Would it possible that one of the rest 4 thread-block get scheduled onto the multiprocessor as long as one of the 4 thread-block is terminated?

Many thanks,


I’m not sure what happens if some blocks take shorter to calculate than others. But I think it means that another block will get executed in its place. It’d be wasteful if the block would have to wait for all the others to complete. Someone of NVidia would have to comment on this, though.