The guide doesn’t seem to be very clear what happens during a load instruction.
There are different solutions thinkable:
The thread stalls, the entire warp stalls, the schedular tries to find another warp to execute, the other warp stalls as well, until all warps are stalled and out of warp resources.
The thread tries to continue with executing other instructions which do not depend on the load, until it hits instructions which depend on the load, it stalls, and everything else stalls like in 1.
The thread stalls and is switched with another thread from the block but warp continues. (Doesn’t seem to be the case).
I am starting to suspect it’s case 1 this would mean it’s impossible to hide the latency inside a single thread by trying to execute other instructions in the same thread while the load happens ?!?
So the claim of “latency hiding” seems exagerated/inflated.
It seems only other warps could be run but those also stall real fast, and then everything is stalled ?!
The guide should be more clear on this.