Concurrent Kernel Execution and Context switching Problem

Hello, i have some questions about CKE and Context switching on Fermi

I have read related topics in the forum, according to Fermi architecture,

  1. At most 8 blocks on 1 SM
  2. At most 16 concurrent kernels on 1 GPU
  3. Kernels only run in parallel once the first kernel does not occupy all SMs anymore

I hope my assumption is not wrong, then here are my questions:

  1. My GPU is GTS-450 which has 4 SMs.
    If i write 2 kernels in different streams, first kernel with 8 blocks and second kernel with 4 blocks.
    I assume that first kernel doesn’t occupy all the resource, maybe 4 blocks fill 1 SM, so the second kernel can launch on device concurrently because there are resouces on GPU.

    So my question is how the blocks be issued to each SM?
    Situation 1: first kernel block1~4 on SM1 block5~8 on SM2, second kernel on SM3 and SM4
    Situation 2: like RR scheduling, first kernel block1,5 on SM1 block2,6 on SM2 block3,7 on SM3, block4,8 on SM4 , and second kernel block1 on SM1 block2 on SM2 …etc.
    Or is not the cases above.

  2. This question is about context switching on GPU.
    In Fermi white paper(page 18), it said that like CPUs, GPUs support multitasking through the use of context switching, where each program receives a time slice of the processor’s resources.
    Now i have 2 kernels in different streams, but first kernel has 1024 blocks so it can occupy all SMs easily. After a time slice, will the first kernel context switch and turn to execute the second kernel ( kernel-level context switch )?
    Or the first kernel will context switch all the blocks and context switch to the block of second kernel even if blocks in first kernel is not completed ( block-level context switch )?

  3. This question is about CKE on GPU
    Still 2 concurrent kernels execution on GPU, if first kernel has 8 blocks and second has 16 blocks, and now first kernel 1~8 and second kernel 1~4 are executed on SMs now.
    If some blocks of first kernel is completed, will it issue the block of second kernel immediately? Or just wait for all blocks on SMs are completed?

    The other condition is if the first kernel occupies the SMs and if some blocks of first kernel is completed, there’re some freed resource to issue blocks of second kernel. Will it issue blocks of second kernel immediately?

Thank you all.

<do you know what informations are saved when you swtch of context? The amount of memory? Which memory?

On CC 2.x devices the compute work distributor will distribute all thread blocks from the first grid launch before distributing thread blocks from the second grid launch. This is not a requirement of the API. It is an observable behavior of CC 2.x devices.

CC >= 3.5 devices can pre-empt the current grid with higher priority grids that were launched either on a higher priority stream or child CDP launches. CDP thread blocks will pre-empt themselves if they are waiting on child work to complete.

The algorithm for work distribution is not documented. Using inline PTX you can read %smid and determine the algorithm. The CUDA API does not define the order of execution so you should make no assumptions.

The Fermi white paper is referring to context switching between GPU contexts. For example, switching between two CUDA contexts or a CUDA context and OpenGL context. CC 2.0 devices can only pre-empt between grid launches or draw calls.

The compute work distributor will distribute work after the completion of each thread block if the SM has sufficient room to accept a new thread block. State change such as cudaFuncSetCacheConfig can cause serialization.
You should review the CUDA occupancy calculator to determine the resource requirements of each kernel.

@ Greg @ NV

Hello Greg,

Do you know what informations are saved when you switch of context?(Registers, program count?) What is the amount of memory allocated for saving a context and in which Memory is it stored?

Suppose i have a kernel that time execution is 5ms ( example).

Launch Kernel;
POP Ctx;

WAIT 1ms; // for example 1/5 of my kernel total exec
POP Ctx;

WAIT 1ms; // Now 2/5 of my kernel total exec
POP Ctx;

Does this code potentially work?? The context saving is automatically done ?

Please Help


For CC < 3.5 context switching for compute happens only after the completion of a grid or memory copy. There is no ability to pre-empt a running grid.

For CC 3.5-5.* context switching for compute can occur during the execution of a grid but only at thread block boundaries. When a context switch is initiated all thread blocks allocated to SMs must complete before the context switch will progress. In this mode no user state needs to be saved. At the save point no thread blocks are executing so there is no need to save SM resources including registers, shared memory, program counters, warp state or local memory.

Context switching is completely transparent to CUDA.

Thank You for your reply.

So can i switch between 2 kernels with one kernel per context?
Using Popctx() pushctx(), can i do something like the code i wrote before?

Supposing cc 3.5

PUSH Ctx0;
Launch Kernel0;
POP Ctx0;

Push ctx1;
Launch kernel1;
Pop ctx1;

PUSH Ctx0;
WAIT 1ms;// 4blocks K0 // for example 1/5 of my kernel total exec
POP Ctx0;

PUSH Ctx1;
WAIT 1ms; //4blocksK1// Now 2/5 of my kernel total exec
POP Ctx1;

Pop ctx0
Push ctx1
…; 4blocksK1
Pop ctx1

Destroy ctx1,ctx0



Suppose you have a kernel code with 4instructions. Is it possible to stop execution when 2instructions have been performed?

As stated above CC <= 5.* devices do not support instruction level pre-emption. CC 3.5-5.* can support pre-emption between thread blocks but all allocated thread blocks must complete before the context swtich will complete.

Switching the active CUcontext on the CPU is independent of the order of execution by the GPU.

If you want finer grain execution then create short running thread blocks and launch small grids.


Thank you very much for your replies.

I have another questions :)

I can do “Multitask” with used of multiple streams (balance kernels (or tasks) in different streams) in the same context.

The context switching is another way to do multitask (apparent multitask, false multitask).

  • Is there another way to do multitask on GPUs ?

  • Suppose i have a GPU with 16 Compute Units. Is it possible to keep 1 or more compute Units “inactive”? I think no, because when i see the scheduling mecanism ( with adding an assembler line asm(%smid), i see that it distributes blocs(suppose a kernel with 16 blocs) in all SMs even if there is enough ressource (register,shared mem …) in 8/16 SMs (2 active blocs per SM for example).

  • Utilization of SFU ressources (special function units) is implicitly done when i use double underscore before transcendentals ? __sinf(), __cos() …

  • Is The scheduling algorithm of blocs and on what SM i want a kernel to run, can be modified in PTX Assembler source ? With the use of special functions or another thing ?

Thank you in advance Greg,

You can use CUDA Dynamic Parallelism or MPS server to get additional parallelism. MPS is the only solution for getting parallelism between two contexts/nodes. MPS is only available on CC 3.5 and above. I believe it is limited to Tesla products.

No. The CUDA API does not support this feature.

The __ transcendentals should map to the MUFU.* instructions. The other two option are to use inline PTX, PTX, or one of the open source developed assemblers. You can verify __ transcendentals by using nvdisasm to display the SASS code. The CUDA profilers (Visual Profiler and Nsight VSE) support source correlated experiments that will show the mapping of SASS to source.

  • Is The scheduling algorithm of blocs and on what SM i want a kernel to run, can be modified in PTX Assembler source ? With the use of special functions or another thing ?

Thank you in advance Greg,



Thank you.

A CUDA context is a state of the GPU (variables, kernels, streams…). The fact of PUSH and POP the CUDA Context, suspend this state and the GPU stop execution of the grid at the threadblock level (supposing CompCapab enable that). Right or Wrong? So only a part of the data allocated for computation, will be compute by the Compute Units (a few number of threadblocks have been scheduled and process data).

For me, the notion of context in GPU is the same as in CPU (state of registers, program counter…). I understand that the context switching operate at grid or thread bloc level depends of the compute capability.

So if I am wrong, how can i do what i want?

Thank you in advance

The CUDA API functions for setting current device or pushing and popping a context simply set the CUDA context pointer in thread local storage. This functions have no impact on GPU execution or work scheduling.