What is a GDU? Do you mean GPU?
Memory loads and stores are just like other instructions. They are executed one warp at a time. So you can’t have 8000 simultaneous loads.
If a load for a warp is coalesced, a single memory transaction is issues. If not, then multiple memory transactions are issues.
Once a warp issues an instruction, usually control passes to another warp and it executes an intruction. In other words, processing proceeds across warps before proceeding down the instructions of a single warp. In this way pipeline and memory latency can be hidden. Once control returns to a warp, if it is not blocked waiting on data from a load or a barrier, then it issues its next instruction.
Memory instructions are no different. Warps proceed sequentially on the multiprocessor. After one warp issues its load, the next proceeds. When a warp is no longer blocked waiting for data control will eventually return to it and it will issue another instruction.
So take this code
0 MOV R0, 0x2
1 MOV R1, 0x4
2 LOAD R2, MEM
3 MUL R0, R0, R1
4 MAD R0, R1, R2, R0
If you have, say, 4 warps, then you’ll get:
0: WARP0(MOV), WARP1(MOV), WARP2(MOV), WARP3(MOV)
1: WARP0(MOV), WARP1(MOV), WARP2(MOV), WARP3(MOV)
2: WARP0(LOAD), WARP1(LOAD), WARP2(LOAD), WARP3(LOAD)
3: WARP0(MUL), WARP1(MUL), WARP2(MUL), WARP3(MUL)
STALL until first LOAD returns
4: WARP0(MAD), WARP1(MAD), WARP2(MAD), WARP3(MAD)
See, since the MUL is not dependent on the LOAD, the MUL can proceed without waiting on it. But once the MUL is completed for all warps, if the data for none of the LOADs is ready, then the warps all stall until one ofthem gets its data.
Note that during that stall, there may be warps from other thread blocks active on the multiprocessor that can proceed with their work. This is important in the case where you have
BAR // __syncthreads();
Because now even if the MUL is independent of the LOAD, all theads in the block have to wait for the barrier (__syncthreads()). This means that you can only hide the memory latency if you have more than one active thread block per multiprocessor.