CUDA context switching overhead of current GPU

r97922153 · October 7, 2018, 1:09am

Hi:

I found the comments previously but it was from 2014:
So I wonder where can I find the figures of recent Nvidia GPU, thanks.

...
CUDA has multiple different levels of context switching. 
Cost to do full GPU context switch is 25-50µs. 
Cost to launch CUDA thread block is 100s of cycles. 
Cost to launch CUDA warps is < 10 cycles. 
Cost to switch between warps allocated to a warp scheduler is 0 cycles and can happen every cycle. 
The cost of CDP SW pre-emption on CC>=3.5 is higher and varies with GPU workload

njuffa · October 7, 2018, 4:22am

Where did you find that data? The data looks plausible to me, but I don’t think this is from NVIDIA documentation? So it is hard to assess how reliable the data is.

If its from a paper by a research group that did determine this data by microbenchmarking, look for newer papers citing that previous paper, or other papers from the same research group.

The basics of NVIDIA GPU operation haven’t changed much since 2014, so to first order the numbers are likely still correct assuming they were correct in 2014.

The following publication might be a good jump-off point to find relevant microbenchmaking studies:

Vasily Volkov, “A microbenchmark to study GPU performance models”. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2018, pp. 421-422

qwerty00 · May 4, 2024, 7:38am

Hi, I am wondering why “Cost to switch between warps allocated to a warp scheduler is 0 cycles and can happen every cycle.” When we do context switching, the least thing we must do is to set the right PC pointer (or maintain a stack of PC or something, I don’t know how it is implemented.) and switch to the thread’s relavent register set. Does this cost nothing?

njuffa · May 4, 2024, 8:47am

Context switching in a GPU does not involve physically moving data in and out of a small register file, courtesy of a register file comprising 65536 4-byte registers per SM (for a total of 256KB of register storage per SM), so there is no need to share registers between “live” threads.

For example, on a single SM we could have 32 “live” warps of 32 threads each, with each thread using 64 registers. The 32 warps could belong to four thread blocks each comprising eight warps (=256 threads), or some other suitable thread-block arrangement.

Robert_Crovella · May 4, 2024, 2:48pm

Yes, there is an adjustment of some sort (e.g. selecting a base register pointer/index, within the file). Presumably this happens when the next instruction is selected. It evidently “costs nothing”. We could imagine, for example, that when the warp scheduler selects the next instruction, along with the instruction is a register file index that that instruction will use for its register set. This index can be derived from the warp (ID) itself. This index could be delivered to the functional unit processing the instruction.

I’m not saying that’s exactly how it works, but as njuffa states, with all live registers always physically present in the register file, the amount of additional “work” could be small enough to be not significant.

Curefab · May 5, 2024, 5:19pm

The more astounding feat is the scheduler itself. Finding out so quickly, which warp to execute next. Some of the logic is moved to compile-time (for fixed instruction latencies), but some are dynamic, e.g. accessing memory and waiting for data to arrive.

The registers are organized in banks and cannot be dynamically indexed (the only index is within the instruction word). So switching between register banks is not so complicated. There is a bit of added complexity, because threads can have different number of registers, 32…256.

njuffa · May 5, 2024, 9:13pm

Isn’t this the same problem faced by out-of-order CPU when scheduling operations onto execution units? For each waiting instruction hardware tracks whether the data it depends on is available, and in the end it is down to a piece of hardware called a priority encoder to select the appropriate “ready” instruction(s). If I recall correctly, that mechanism is on the critical path for OOO machines and may require some clever circuit design.

Topic		Replies	Views
Measure warp context switching time CUDA Programming and Performance	1	778	July 8, 2014
Context switching policy CUDA Programming and Performance	2	187	April 20, 2025
Contexts: Performance question overhead by switching the context CUDA Programming and Performance	3	2867	February 6, 2009
why in thread context switching there is no need to store state? CUDA Programming and Performance	1	1304	June 3, 2015
GPU Context switch of multiple processes CUDA Programming and Performance	8	4219	February 24, 2021
Metrics on cuda context switch CUDA Programming and Performance	0	370	December 8, 2020
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5571	February 3, 2010
Block context switch penalty? CUDA Programming and Performance	2	2734	October 24, 2009
Concurrent Kernel Execution and Context switching Problem CUDA Programming and Performance	11	8384	July 8, 2015
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8676	March 28, 2008

CUDA context switching overhead of current GPU

Related topics