Coalesced Transaction Size


i was just wondering how the possible size of one memory transaction (e.g. 256 Byte at CC 2.x) fits with the device’s memory interface width, e.g. 384 Bit at GTX580.

How is the maximum transaction size determined? Is there a relationship to the Interface width? Some Driver stuff?

Also i don’t fit with the figures and facts about coalesced memory access in section 3.2.1 of Cuda BP Guide v 4.0:

Offset Copy: figure tells about heavy impact at GTX280 - in fact this device has no problem with offset copy since it results in only 1 more coalesced access (Prog. Guide V4.0)

Strided Copy: figure tells about immediate impact with a stride of 2. I experienced no impact at all with stride 2, slightly impact with stride 4, and strong growing impact with stride > 4

what shall i believe? :D

Probably in your case CC2.x cache affects a lot.

In the last several generations the memory interface always consists of multiple 64 bit wide channels. Wider transactions (up to 128 bytes on Fermi GPUs) will be carried out in bursts in a single channel and thus be more efficient.
The mapping of addresses to memory channels is hashed on Fermi class GPUs to prevent partition camping.

As L F wrote, the cache in Fermi GPUs affects this a lot. Memory transactions on Fermi always have the width of a full cacheline which is 128 byte, or 32 byte is the L1 cache was disabled with [font=“Courier New”]-Xptxas -dlcm=cg[/font] at compile time.