A single memory transaction can be either 32 bytes (compute capability >= 1.2), 64 bytes or 128 bytes. Does this mean that 4 such 32 byte transactions (or 2 64 byte transactions) can be performed in parallel from different locations in global memory? Or do they have to be consecutive memory address ranges?
consecutive memory address range,
I’m not sure of your question, but yes, most nvidia GPUs have multiple memory channels and can service transactions from multiple locations in parallel. This doesn’t always happen, of course, there are certain rules, and the benefits aren’t straightforward either. In general, if you can issue a single 128 byte transaction, do that.
Sorry for the late reply. I was basically wondering how int8 structs (like the built-in vector types, aligned to 16 bytes, but with 8 integers = 32 bytes) read by each thread from different locations in global memory perform. Can a GT200 series card read 4 such 32 byte structs in parallel, even though the structs are not stored in consecutive memory address ranges?
Thanks.
The short answer is no.
The long answer is that at the assembly level, a thread can read values that are 1, 2, 4, 8, or 16 bytes in size. Everything else gets compiled into several instructions, you can examine ptx code to see what happens. Coalescing requirements apply to a given instruction (so refer to the programming guide for details), there is no coalescing across different instructions.
Paulius
Oh yeah of course, I forgot that a thread can only read a maximum of 16 bytes into registers in a single instruction. So let’s say I’m using int4 structs, can 4 such structs be read from random addresses in parallel, each in one 32 byte transaction, thus using 50% of the available bandwidth?
Thanks a lot for your replies.
Yes, if threads read random int4s, they will cause 16 byte reads that are upgraded to 32 byte transactions.
The amount of theoretical bandwidth that will be used is a trickier question. Beside wasting 16 bytes, you also incur the overhead of the transaction itself (charging up the DRAM, addresssing it, etc.). You also have to consider how your accesses are distributed among the available memory channels. But in short, yeah, something like 25-50%, in the best case. Uncoalesced int4 reads are much more efficient than uncoalesced int reads.