I’m currently optimizing/restructuring some code which needs to access sparse, random, single words from large amounts of device memory… just about the worst case memory access pattern for CUDA! Unfortunately my code is memory bandwidth bottlenecked. This has led to some questions, not about my code, but just about CUDA and GPU memory access in general.
We know that current CUDA GPUs access memory in half-warp segments. Compute device 1.2 coalescing is much more versatile than earlier devices in that you don’t need to keep queried addresses in order as long as all the threads in the half-warp read from the same segment.
There’s a lot of detail in the programming guide in Chapter 5. This documents that in device 1.2+ (G200), you can use a transaction size as small as 32 bytes as long as each thread accesses memory by only 8-bit words. If you access by 16-bit words, your transaction size is 64 bytes. If you access by 32 bit words, your transaction size is 128 bytes.
These are all assuming the addresses you query are within the same memory segment.
But these query-size dependencies bother me. I’m not sure why the size of the word you access should make any difference. From the docs, it sounds like you’d get three different behaviors as follows:
__global__ kernel(char *deviceArray)
{
char c=((char *)deviceArray)[0]; // reads via a 32 byte transaction
short s=((short *)deviceArray)[0]; // reads via a 64 byte transaction
int i=((int *)deviceArray)[0]; // reads via a 128 byte transaction
}
If I run a speed test on a 280GTX with actual code, I don’t see any speed difference between the three cases.
Now the programming guide goes on to give “more precise” details which say that if the queried addresses are all acceptable, 64 byte transactions might be reduced to 32, and 128 byte transactions might become 64.
However my speed tests show that the 128 byte read of ints is indeed the same speed as chars, so it sounds like the 128 byte read is being reduced down to 32.
Again, this makes a lot of sense, but it’s clearly not what the programming guide says either in its general description or its “more precise” followup.
So it may just be the language in the programming guide… admittedly it’s trying to explain a set of hardware access rules using English and that’s always a challenge.
So two questions:
-
Is the programming guide wrong, and it’s possible to use 32 byte memory transactions even when reading/writing 32-bit ints (as long as they all fall in the 32 byte segment)? I think this is true.
-
Out of curiosity, what part of the hardware limits the transaction size? (Ie, why doesn’t it allow even tiny 4 byte transactions?). Is it based on the fact that each memory chip holds one slice of memory, so reading a transaction multiplexes the query to each RAM chip and pulls say 16 bits from each? Or is it some limitation of the GPU memory controller, simplifying its transistor usage by dealing with fewer memory chunks of larger size? Or some other optimization?
-
CPUs also access memory in transaction chunks. I think modern Intel Core CPUs use 64 bytes as well, since that’s their CPU cache line size. Are those transaction sizes limited the same way as GPUs, where it’s some multiplexed RAM chip query? Or what’s the tradeoff?
In my G300 wishlist, I’d like to put “allow smaller device memory transaction sizes”. G300 is taped out already, so please move my wish to G400’s wishlist if necessary. :-)