Why half-warp coalesced memory reads?

Hi,

I’m currently trying to figure out why the memory fetches are done in half-warps (16 threads) and not in a warp. My theory is this: the instruction unit works at a different clock speed than the SP’s and SFU’s and issues two instructions to these units each (totally taking 4 clock cycles). Now during 4 clock cycles the SP’s have time to perform 4*8 operations, thus yielding a warp of 32 threads. I thus believe that the instruction unit need only tell the 8 SP’s to peform a memory fetch (giving the SP’s 2 clock cycles) which would mean 8 threads/cycle * 2 cycles = 16 threads or a half-warp doing the coalesced memory fetching. Can someone verify this theory or explain to me how this acutally works and give me a proper reference?

Thanks, Ian.

16 smem banks…

(32/2)=(16)=(half-warp)

Hi!

Thanks for your input! Are you saying the reason global memory is coalesced into half-warps, and not say quarter-warps, is simply due to NVIDIA simplyfing the GPU design as smem-reads already are a maximum 16 threads/banks wide?

Ian

What I said does not explain global memory coalescing – obviously…

But my hypothesis is that “half-warps” exist because of shared memory… And since half-warps exist – the global memory coalescing also requires a “half-warp” based coalescing rule…

but yeah… everything is a hypothesis… Some1 else might have a better explanation

Yes well I’m pretty sure that the memories are adapted to inner microarchitecture workings. It seems likely that there are 32 threads because of the different speeds of the scheduler and the fact that it issues instructions to both the SP’s and the SFU’s (2 instructions and 2 cc in IS speed) taking totally 4 clock cycles, thus 8 threads/cc * 4 cc = 32. Just as Ian suggested before. I’ve attached an image to make this more clear.

Now the reason for 16 banks in shared memory and the reason why global memory is read by a half-warp becoms more clear. This needs only be done by the SP’s who have 2 cc’s to perform the memory fetch yielding 16 threads (see picture again). This is the theory anyways, i haven’t been able to confirm it either.

but all in all, it seems much more likely that the memories have been adapted to the SP’s etc,. than the other way around.

Regards,

jimmy
why32.png