I’m currently trying to figure out why the memory fetches are done in half-warps (16 threads) and not in a warp. My theory is this: the instruction unit works at a different clock speed than the SP’s and SFU’s and issues two instructions to these units each (totally taking 4 clock cycles). Now during 4 clock cycles the SP’s have time to perform 4*8 operations, thus yielding a warp of 32 threads. I thus believe that the instruction unit need only tell the 8 SP’s to peform a memory fetch (giving the SP’s 2 clock cycles) which would mean 8 threads/cycle * 2 cycles = 16 threads or a half-warp doing the coalesced memory fetching. Can someone verify this theory or explain to me how this acutally works and give me a proper reference?