Memory transaction size

I’m currently optimizing/restructuring some code which needs to access sparse, random, single words from large amounts of device memory… just about the worst case memory access pattern for CUDA! Unfortunately my code is memory bandwidth bottlenecked. This has led to some questions, not about my code, but just about CUDA and GPU memory access in general.

We know that current CUDA GPUs access memory in half-warp segments. Compute device 1.2 coalescing is much more versatile than earlier devices in that you don’t need to keep queried addresses in order as long as all the threads in the half-warp read from the same segment.

There’s a lot of detail in the programming guide in Chapter 5. This documents that in device 1.2+ (G200), you can use a transaction size as small as 32 bytes as long as each thread accesses memory by only 8-bit words. If you access by 16-bit words, your transaction size is 64 bytes. If you access by 32 bit words, your transaction size is 128 bytes.

These are all assuming the addresses you query are within the same memory segment.

But these query-size dependencies bother me. I’m not sure why the size of the word you access should make any difference. From the docs, it sounds like you’d get three different behaviors as follows:

__global__ kernel(char *deviceArray)

{

	char c=((char *)deviceArray)[0];	 // reads via a 32 byte transaction

	short s=((short *)deviceArray)[0];   // reads via a 64 byte transaction

	int i=((int *)deviceArray)[0];		   // reads via a 128 byte transaction

}

If I run a speed test on a 280GTX with actual code, I don’t see any speed difference between the three cases.

Now the programming guide goes on to give “more precise” details which say that if the queried addresses are all acceptable, 64 byte transactions might be reduced to 32, and 128 byte transactions might become 64.

However my speed tests show that the 128 byte read of ints is indeed the same speed as chars, so it sounds like the 128 byte read is being reduced down to 32.

Again, this makes a lot of sense, but it’s clearly not what the programming guide says either in its general description or its “more precise” followup.

So it may just be the language in the programming guide… admittedly it’s trying to explain a set of hardware access rules using English and that’s always a challenge.

So two questions:

  1. Is the programming guide wrong, and it’s possible to use 32 byte memory transactions even when reading/writing 32-bit ints (as long as they all fall in the 32 byte segment)? I think this is true.

  2. Out of curiosity, what part of the hardware limits the transaction size? (Ie, why doesn’t it allow even tiny 4 byte transactions?). Is it based on the fact that each memory chip holds one slice of memory, so reading a transaction multiplexes the query to each RAM chip and pulls say 16 bits from each? Or is it some limitation of the GPU memory controller, simplifying its transistor usage by dealing with fewer memory chunks of larger size? Or some other optimization?

  3. CPUs also access memory in transaction chunks. I think modern Intel Core CPUs use 64 bytes as well, since that’s their CPU cache line size. Are those transaction sizes limited the same way as GPUs, where it’s some multiplexed RAM chip query? Or what’s the tradeoff?

In my G300 wishlist, I’d like to put “allow smaller device memory transaction sizes”. G300 is taped out already, so please move my wish to G400’s wishlist if necessary. :-)

That’s a lot of questions External Media I’ll just try to answer two…

On G80 to GT200 architectures, 32-bit GDDR3 RAM chips are grouped in pairs, and every transaction is routed to a pair of chips put in parallel. This is why bus widths of NVidia GPUs are all 64-bit multiples.

Then, each GDDR3 chip only accept burst accesses of 4 consecutive words (see http://en.wikipedia.org/wiki/GDDR3). This is what allows them to reach high frequencies. GDDR-5 doubles the burst length again, so the trend is toward still larger transactions.

As a consequence, the minimal transaction size for the DRAM is 32 bytes. (This doesn’t imply that it is the minimal transaction size for the on-chip interconnect, which may be smaller, nor the minimal size for the memory controller, which may be larger…)

BTW, the size of a texture cache line is 256 bytes, so memory controllers are likely optimized for this granularity. (this is reflected in benchmarks accessing coalesced memory)

All DRAM technologies are based on the same principles, so yes. Each memory channel is 64-bit wide as on NVidia GPUs, and the typical burst length is 4, so the minimal transaction size is also 32 bytes.

In the three cases you list, all would have the same behavior.

The transaction size isn’t fixed. It works like this:

  1. each thread in a half warp issues a load or a store
  2. the memory controller looks at the address in each load and finds the boundaries per aligned section
  3. if the addresses are not all in the same aligned section, split the load into however many aligned sections are used and proceed.
  4. for each aligned section, start with a 128 byte transaction.
  5. if all addresses fall within 64 bytes, replace the 128 byte transaction with a 64 byte transaction.
  6. if all addresses fall within 32 bytes, replace the 64 byte transaction with a 32 byte transaction.

Therefore, if you’ve got a half warp reading the same memory location (and one that is aligned on a four byte boundary), you get a 32 byte load.

This is all coming out of my head (it’s how things work, I’ve hand-waved away a lot of behavior regarding alignment and things but in general this is correct), so maybe the programming guide is wrong or unclear. I can check.

That’s useful to know, and it makes sense that there’s a tradeoff, especially since the query address needs to get sent to the RAM chip to begin with so it’s obviously more efficient to get more data from each address query.

This doesn’t explain why G80 GPUs have a minimum transaction size of 64 bytes, but G200 have a minimum size of 32. GDDR3 is used on G80 as well. Perhaps the larger size is just a limitation of the G80 memory controller design?

Thanks for the tech details, Sylvain!

This all makes sense and what I expected. The programming guide has some extra confusion about the access method (8 but vs 16 but vs 32 bit) making a difference, even though that just doesn’t seem right, and testing shows it doesn’t make any difference.

Thanks for the confirmation!