As can be understood from CUDA decsription, when several warps are executed on one multiprocessor together, they are usually “switched” one by one. I mean: while one warp is waiting for global memory read, instruction of another warp can be executed in the same time. They call it “hiding of global memory latency”. The similar thing s hiding “read-after-write dependencies latency”.
I hope the previous statements are right. Please correct me if I do mistake.
My question is: does anybody know more details about this mechanism?
What does time spent on switching depends from?
Does it matters if warps are from one block or from different blocks?
Something else? I would appreciate any related information.
Thanks in advance.