I found the comments previously but it was from 2014:
So I wonder where can I find the figures of recent Nvidia GPU, thanks.
...
CUDA has multiple different levels of context switching.
Cost to do full GPU context switch is 25-50µs.
Cost to launch CUDA thread block is 100s of cycles.
Cost to launch CUDA warps is < 10 cycles.
Cost to switch between warps allocated to a warp scheduler is 0 cycles and can happen every cycle.
The cost of CDP SW pre-emption on CC>=3.5 is higher and varies with GPU workload
Where did you find that data? The data looks plausible to me, but I don’t think this is from NVIDIA documentation? So it is hard to assess how reliable the data is.
If its from a paper by a research group that did determine this data by microbenchmarking, look for newer papers citing that previous paper, or other papers from the same research group.
The basics of NVIDIA GPU operation haven’t changed much since 2014, so to first order the numbers are likely still correct assuming they were correct in 2014.
The following publication might be a good jump-off point to find relevant microbenchmaking studies:
Vasily Volkov, “A microbenchmark to study GPU performance models”. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2018, pp. 421-422
Hi, I am wondering why “Cost to switch between warps allocated to a warp scheduler is 0 cycles and can happen every cycle.” When we do context switching, the least thing we must do is to set the right PC pointer (or maintain a stack of PC or something, I don’t know how it is implemented.) and switch to the thread’s relavent register set. Does this cost nothing?
Context switching in a GPU does not involve physically moving data in and out of a small register file, courtesy of a register file comprising 65536 4-byte registers per SM (for a total of 256KB of register storage per SM), so there is no need to share registers between “live” threads.
For example, on a single SM we could have 32 “live” warps of 32 threads each, with each thread using 64 registers. The 32 warps could belong to four thread blocks each comprising eight warps (=256 threads), or some other suitable thread-block arrangement.
Yes, there is an adjustment of some sort (e.g. selecting a base register pointer/index, within the file). Presumably this happens when the next instruction is selected. It evidently “costs nothing”. We could imagine, for example, that when the warp scheduler selects the next instruction, along with the instruction is a register file index that that instruction will use for its register set. This index can be derived from the warp (ID) itself. This index could be delivered to the functional unit processing the instruction.
I’m not saying that’s exactly how it works, but as njuffa states, with all live registers always physically present in the register file, the amount of additional “work” could be small enough to be not significant.
The more astounding feat is the scheduler itself. Finding out so quickly, which warp to execute next. Some of the logic is moved to compile-time (for fixed instruction latencies), but some are dynamic, e.g. accessing memory and waiting for data to arrive.
The registers are organized in banks and cannot be dynamically indexed (the only index is within the instruction word). So switching between register banks is not so complicated. There is a bit of added complexity, because threads can have different number of registers, 32…256.
Isn’t this the same problem faced by out-of-order CPU when scheduling operations onto execution units? For each waiting instruction hardware tracks whether the data it depends on is available, and in the end it is down to a piece of hardware called a priority encoder to select the appropriate “ready” instruction(s). If I recall correctly, that mechanism is on the critical path for OOO machines and may require some clever circuit design.