CUDA context switching overhead of current GPU


I found the comments previously but it was from 2014:
So I wonder where can I find the figures of recent Nvidia GPU, thanks.

CUDA has multiple different levels of context switching. 
Cost to do full GPU context switch is 25-50µs. 
Cost to launch CUDA thread block is 100s of cycles. 
Cost to launch CUDA warps is < 10 cycles. 
Cost to switch between warps allocated to a warp scheduler is 0 cycles and can happen every cycle. 
The cost of CDP SW pre-emption on CC>=3.5 is higher and varies with GPU workload

Where did you find that data? The data looks plausible to me, but I don’t think this is from NVIDIA documentation? So it is hard to assess how reliable the data is.

If its from a paper by a research group that did determine this data by microbenchmarking, look for newer papers citing that previous paper, or other papers from the same research group.

The basics of NVIDIA GPU operation haven’t changed much since 2014, so to first order the numbers are likely still correct assuming they were correct in 2014.

The following publication might be a good jump-off point to find relevant microbenchmaking studies:

Vasily Volkov, “A microbenchmark to study GPU performance models”. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2018, pp. 421-422