Concurrent Kernel Execution and Context switching Problem

c.ryan · July 25, 2011, 7:09am

Hello, i have some questions about CKE and Context switching on Fermi

I have read related topics in the forum, according to Fermi architecture,

At most 8 blocks on 1 SM
At most 16 concurrent kernels on 1 GPU
Kernels only run in parallel once the first kernel does not occupy all SMs anymore

I hope my assumption is not wrong, then here are my questions:

My GPU is GTS-450 which has 4 SMs.
If i write 2 kernels in different streams, first kernel with 8 blocks and second kernel with 4 blocks.
I assume that first kernel doesn’t occupy all the resource, maybe 4 blocks fill 1 SM, so the second kernel can launch on device concurrently because there are resouces on GPU.

So my question is how the blocks be issued to each SM?
Situation 1: first kernel block1~4 on SM1 block5~8 on SM2, second kernel on SM3 and SM4
Situation 2: like RR scheduling, first kernel block1,5 on SM1 block2,6 on SM2 block3,7 on SM3, block4,8 on SM4 , and second kernel block1 on SM1 block2 on SM2 …etc.
Or is not the cases above.
This question is about context switching on GPU.
In Fermi white paper(page 18), it said that like CPUs, GPUs support multitasking through the use of context switching, where each program receives a time slice of the processor’s resources.
Now i have 2 kernels in different streams, but first kernel has 1024 blocks so it can occupy all SMs easily. After a time slice, will the first kernel context switch and turn to execute the second kernel ( kernel-level context switch )?
Or the first kernel will context switch all the blocks and context switch to the block of second kernel even if blocks in first kernel is not completed ( block-level context switch )?
This question is about CKE on GPU
Still 2 concurrent kernels execution on GPU, if first kernel has 8 blocks and second has 16 blocks, and now first kernel 1~8 and second kernel 1~4 are executed on SMs now.
If some blocks of first kernel is completed, will it issue the block of second kernel immediately? Or just wait for all blocks on SMs are completed?

The other condition is if the first kernel occupies the SMs and if some blocks of first kernel is completed, there’re some freed resource to issue blocks of second kernel. Will it issue blocks of second kernel immediately?

Thank you all.

RamosPacos · June 17, 2015, 9:06pm

Hi
<do you know what informations are saved when you swtch of context? The amount of memory? Which memory?

Greg · June 18, 2015, 12:41am

On CC 2.x devices the compute work distributor will distribute all thread blocks from the first grid launch before distributing thread blocks from the second grid launch. This is not a requirement of the API. It is an observable behavior of CC 2.x devices.

CC >= 3.5 devices can pre-empt the current grid with higher priority grids that were launched either on a higher priority stream or child CDP launches. CDP thread blocks will pre-empt themselves if they are waiting on child work to complete.

The algorithm for work distribution is not documented. Using inline PTX you can read %smid and determine the algorithm. The CUDA API does not define the order of execution so you should make no assumptions.

The Fermi white paper is referring to context switching between GPU contexts. For example, switching between two CUDA contexts or a CUDA context and OpenGL context. CC 2.0 devices can only pre-empt between grid launches or draw calls.

The compute work distributor will distribute work after the completion of each thread block if the SM has sufficient room to accept a new thread block. State change such as cudaFuncSetCacheConfig can cause serialization.
You should review the CUDA occupancy calculator to determine the resource requirements of each kernel.

RamosPacos · June 18, 2015, 3:29pm

@ Greg @ NV

Hello Greg,

Do you know what informations are saved when you switch of context?(Registers, program count?) What is the amount of memory allocated for saving a context and in which Memory is it stored?

Suppose i have a kernel that time execution is 5ms ( example).

PUSH Ctx;
Launch Kernel;
POP Ctx;

PUSH Ctx;
WAIT 1ms; // for example 1/5 of my kernel total exec
POP Ctx;

PUSH Ctx;
WAIT 1ms; // Now 2/5 of my kernel total exec
POP Ctx;

Does this code potentially work?? The context saving is automatically done ?

Please Help

Greg · June 18, 2015, 3:56pm

RamosPacos,

For CC < 3.5 context switching for compute happens only after the completion of a grid or memory copy. There is no ability to pre-empt a running grid.

For CC 3.5-5.* context switching for compute can occur during the execution of a grid but only at thread block boundaries. When a context switch is initiated all thread blocks allocated to SMs must complete before the context switch will progress. In this mode no user state needs to be saved. At the save point no thread blocks are executing so there is no need to save SM resources including registers, shared memory, program counters, warp state or local memory.

Context switching is completely transparent to CUDA.

RamosPacos · June 18, 2015, 4:55pm

Greg,
Thank You for your reply.

So can i switch between 2 kernels with one kernel per context?
Using Popctx() pushctx(), can i do something like the code i wrote before?

Supposing cc 3.5

PUSH Ctx0;
Launch Kernel0;
POP Ctx0;

Push ctx1;
Launch kernel1;
Pop ctx1;

PUSH Ctx0;
WAIT 1ms;// 4blocks K0 // for example 1/5 of my kernel total exec
POP Ctx0;

PUSH Ctx1;
WAIT 1ms; //4blocksK1// Now 2/5 of my kernel total exec
POP Ctx1;

Pushctx0
…;4blocksK0
Pop ctx0
Push ctx1
…; 4blocksK1
Pop ctx1

Destroy ctx1,ctx0
End

*****************************************************,

RamosPacos · June 18, 2015, 5:17pm

Greg,

Suppose you have a kernel code with 4instructions. Is it possible to stop execution when 2instructions have been performed?

Greg · June 18, 2015, 7:18pm

As stated above CC <= 5.* devices do not support instruction level pre-emption. CC 3.5-5.* can support pre-emption between thread blocks but all allocated thread blocks must complete before the context swtich will complete.

Switching the active CUcontext on the CPU is independent of the order of execution by the GPU.

If you want finer grain execution then create short running thread blocks and launch small grids.

RamosPacos · June 24, 2015, 8:04am

Greg,

Thank you very much for your replies.

I have another questions :)

I can do “Multitask” with used of multiple streams (balance kernels (or tasks) in different streams) in the same context.

The context switching is another way to do multitask (apparent multitask, false multitask).

Is there another way to do multitask on GPUs ?
Suppose i have a GPU with 16 Compute Units. Is it possible to keep 1 or more compute Units “inactive”? I think no, because when i see the scheduling mecanism ( with adding an assembler line asm(%smid), i see that it distributes blocs(suppose a kernel with 16 blocs) in all SMs even if there is enough ressource (register,shared mem …) in 8/16 SMs (2 active blocs per SM for example).
Utilization of SFU ressources (special function units) is implicitly done when i use double underscore before transcendentals ? __sinf(), __cos() …
Is The scheduling algorithm of blocs and on what SM i want a kernel to run, can be modified in PTX Assembler source ? With the use of special functions or another thing ?

Thank you in advance Greg,

Greg · June 24, 2015, 11:12pm

You can use CUDA Dynamic Parallelism or MPS server to get additional parallelism. MPS is the only solution for getting parallelism between two contexts/nodes. MPS is only available on CC 3.5 and above. I believe it is limited to Tesla products.

No. The CUDA API does not support this feature.

The __ transcendentals should map to the MUFU.* instructions. The other two option are to use inline PTX, PTX, or one of the open source developed assemblers. You can verify __ transcendentals by using nvdisasm to display the SASS code. The CUDA profilers (Visual Profiler and Nsight VSE) support source correlated experiments that will show the mapping of SASS to source.

Is The scheduling algorithm of blocs and on what SM i want a kernel to run, can be modified in PTX Assembler source ? With the use of special functions or another thing ?

Thank you in advance Greg,

[/quote]

RamosPacos · July 7, 2015, 5:45pm

Greg,

Thank you.

A CUDA context is a state of the GPU (variables, kernels, streams…). The fact of PUSH and POP the CUDA Context, suspend this state and the GPU stop execution of the grid at the threadblock level (supposing CompCapab enable that). Right or Wrong? So only a part of the data allocated for computation, will be compute by the Compute Units (a few number of threadblocks have been scheduled and process data).

For me, the notion of context in GPU is the same as in CPU (state of registers, program counter…). I understand that the context switching operate at grid or thread bloc level depends of the compute capability.

So if I am wrong, how can i do what i want?

Thank you in advance

Greg · July 8, 2015, 2:20pm

The CUDA API functions for setting current device or pushing and popping a context simply set the CUDA context pointer in thread local storage. This functions have no impact on GPU execution or work scheduling.

Topic		Replies	Views
Concurrent Kernel Execution and Context switching Problem CUDA Programming and Performance	2	5715	July 27, 2011
Concurrently kernels running on one device CUDA Programming and Performance	17	2738	March 2, 2010
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	888	July 27, 2017
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19719	July 5, 2011
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10897	September 30, 2010
Concurrent kernel CUDA Programming and Performance	8	1718	January 14, 2024
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10619	June 26, 2012
performance cost of too many blocks? CUDA Programming and Performance	12	2820	December 4, 2018
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17728	April 1, 2010
More blocks than SMs may not make sense CUDA Programming and Performance	13	2685	November 11, 2010

Concurrent Kernel Execution and Context switching Problem

Related topics