Question about coalesced global load/store

Question about coalesced global load/store

Is coalesced global load or store taking advantage of Dram’s burst mode ?

which burst mode?

coalesced is the best performance you can get, so as close to the memory bandwidth as you can get.

GDDR3 requires a burst of 4, minimum. Burst mode hasn’t been optional since SDR. (Incidentally, GDDR3 is not related to DDR3, which requires burst 8. GDDR4+5 also are 8.) The channels on GeForces are 64bits wide. The size of a burst is thus 32 bytes. When issuing a coallesced read from a half-warp you’re requesting 64 bytes.

So, yes.

Additional info: channels are interlaced with a block-size of 256 bytes (remember the total stride may be odd, since the number of channels is often odd). Don’t nail the same channel from all threads and blocks, but at the same time requesting all 256b sequentially from one block at once is also good (provided that block owns the whole multiprocessor).

Of course, alignment matters.