global memory latency

I have a question about latency in reading from global memory.

From of the Programming Guide, the following code will take 400
to 600 clock cycles to read from global memory:

shared float shared[32];
device float device[32];
shared[threadIdx.x] = device[threadIdx.x];

What if I change the code to:

shared float shared[32];
device float device[32];
if (threadIdx.x == 0) shared[threadIdx.x] = device[threadIdx.x];

In this case, only one thread will be performing a read. Will this read
still take 400 to 600 clock cycles, or will it be much faster? (I ask this
question because sometimes the very first or last thread in a block
may need an extra variable from global memory.)


The answer is no - it will not be much faster. Actually the difference between the two access patters in terms of required clock cycles should not differ a whole lot.

The way to see this is the other way round: One memory read will take a long time. Coalesced memory reads (=accessing subsequent areas in the memory by subsequent threads) will not add significant latency.
Consider reading chapter in the programming guide for more information on global memory access.
You could consider constant memory for those values (if they are read only) or store them in shared memory if you calculated them before in the same kernel/block.

Thanks for the reply. I guess I must have confused memory latency with memory bandwidth.

So, it seems that, for both the original code in and my modified code with the ‘if’

condition, there will be the same memory latency of 400 to 600 clock cycles. However, the

original code should still take much longer to complete, since it has as many reads as

there are threads, whereas the modified code has only one read. Or am I wrong?

I cannot use constant memory for the extra variables, since they too will be modified. So,

I guess I will have to assign an extra thread (tid = 0) to read in the extra variable (needed

by tid = 1) from global memory and skip tid = 0 in the calculations.


It should, yes. But CUDA is able to optimize this memory access because the memory is accessed sequentially (coalesced memory access - I referenced the relevant chapter in the programming guide in my previous post).

Well, your original example only used 32 threads (as indicated by the array sizes). If you had run more than 32 threads it would have crashed. The way to think about coalescing is that it combines all accesses within a group of threads (a warp) into a single memory transaction (assuming the addresses satisfy coalescing conditions, which yours do).

On G8X coalescing happens at half-warp resolution, so your code without the if will take about twice as long to execute (two memory transactions) as the version with the if (one memory transaction).

Note that changing the if condition to if (threadIdx.x < 16) will likely result in the same performance as if (threadIdx.x == 0) in your example.


This is where I am still confused. Let us go back to the original example (without the if).

Since there are 32 threads in this example, these will be grouped into 2 half-warps. In

each half-warp, there will be 16 threads reading a total of 16 four-byte floats, or a total

of 16x32 = 512 bits, which are all coalesced.

However, from of the Programming Guide, the device is capable of reading at

most 128 bits in a single instruction. So, to read 512 bits, there will need to be 512/128

= 4 separate load instructions. My question now is: Do each of these 4 load instructions

require 400 to 600 clock cycles of latency, in which case there would be a total of 1600

to 2400 clock cycles of latency!

I must be wrong about how global memory latency works. Please enlighten me if you will.

This is a good question. I have been pondering over this as well. Somehow I assumed that there would be some streaming operation over the bus and the 4 floats would be pumped in (Just like how load multiple would work on powerPC) fast… My assumption…

OR perhaps

I think 16 8-bit accessing loads may be coalesced into 1 128-bit load instruction. May b, thats why the half-warp size is 16… But I am surely missing something here. THe global read coalescing should work for floats and ints as well…

latency is not the same as throughput.
if you have 400 cycles of latency (for all the loads):

clock = 0 read in a float
clock = 1 read in a float

clock 400 float has arrived
clock 401 float has arrived

so the latency of reading in say 16 floats is (from the moment of reading in the first) 416 cycles and not 400*16

8-bit memory reads (like char) are not coalesced.

this is only true if your data access pattern is coalescing, same as a seek operation on a hard disk which takes long, but reading a chunk of consecutive data is fast.

Let me see if I understand you. To simply discussion, suppose there are only 2

threads. In the following code,

shared float shared[2];

device float device[2];

shared[threadIdx.x] = device[threadIdx.x];

the timing would, according to Mr. DenisR, look something like this:

clock = 0 read in a float

clock = 1 read in a float

clock 400 float (device[0]) has arrived

clock 401 float (device[1]) has arrived

Total time = 401 clock cycles.

But if I change the code to:

shared float shared[2];

device float device[9];

shared[threadIdx.x] = device[8*threadIdx.x];

Now the stride is 8 and so the two reads are no longer from contiguous memory.

The timing would now become something like:

clock = 0 read in a float

… (latency)

clock 400 float (device[0]) has arrived

clock = 401 read in a float

… (latency)

clock 801 float (device[8]) has arrived

Total time = 801 clock cycles

Could this be right?

I remember to have read someone from nvidia say that uncoalesced reads are serialized. So that would indeed mean that the total time is doubled. However my memory is not that super.

But maybe it is easier to try than to guess? ;) You can always make a small test kernel to check these things out. But I think it is just better to try to keep everything nice and coalesced. And otherwise use textures as I have also read that reading in data from a texture is faster than reading in data from global memory uncoalesced.

I will run some tests once I have finished building my new system.