Measure warp context switching time

As we know, when kernel has a global memory access or read-after-write dependency latency,another warp got switched to execute to hide the latency to maximize the throughput.

I was wondering if anyone knows how to measure the time of warp switching.

Any suggestions or comments are welcome.

It seems like you’ve already asked this, and Greg Smith gave you some pretty useful comments:

[url]cuda - Measure the overhead of context switching in GPU - Stack Overflow