global memory latency

nasacort · December 3, 2007, 10:35pm

I have a question about latency in reading from global memory.

From 5.1.1.3 of the Programming Guide, the following code will take 400
to 600 clock cycles to read from global memory:

shared float shared[32];
device float device[32];
shared[threadIdx.x] = device[threadIdx.x];

What if I change the code to:

shared float shared[32];
device float device[32];
if (threadIdx.x == 0) shared[threadIdx.x] = device[threadIdx.x];

In this case, only one thread will be performing a read. Will this read
still take 400 to 600 clock cycles, or will it be much faster? (I ask this
question because sometimes the very first or last thread in a block
may need an extra variable from global memory.)

Thanks.

seb · December 4, 2007, 12:13am

The answer is no - it will not be much faster. Actually the difference between the two access patters in terms of required clock cycles should not differ a whole lot.

The way to see this is the other way round: One memory read will take a long time. Coalesced memory reads (=accessing subsequent areas in the memory by subsequent threads) will not add significant latency.
Consider reading chapter 5.1.2.1 in the programming guide for more information on global memory access.
You could consider constant memory for those values (if they are read only) or store them in shared memory if you calculated them before in the same kernel/block.

nasacort · December 4, 2007, 8:45pm

Thanks for the reply. I guess I must have confused memory latency with memory bandwidth.

So, it seems that, for both the original code in 5.1.1.3 and my modified code with the ‘if’

condition, there will be the same memory latency of 400 to 600 clock cycles. However, the

original code should still take much longer to complete, since it has as many reads as

there are threads, whereas the modified code has only one read. Or am I wrong?

I cannot use constant memory for the extra variables, since they too will be modified. So,

I guess I will have to assign an extra thread (tid = 0) to read in the extra variable (needed

by tid = 1) from global memory and skip tid = 0 in the calculations.

Michael

seb · December 4, 2007, 9:17pm

It should, yes. But CUDA is able to optimize this memory access because the memory is accessed sequentially (coalesced memory access - I referenced the relevant chapter in the programming guide in my previous post).

Mark_Harris · December 11, 2007, 3:07pm

Well, your original example only used 32 threads (as indicated by the array sizes). If you had run more than 32 threads it would have crashed. The way to think about coalescing is that it combines all accesses within a group of threads (a warp) into a single memory transaction (assuming the addresses satisfy coalescing conditions, which yours do).

On G8X coalescing happens at half-warp resolution, so your code without the if will take about twice as long to execute (two memory transactions) as the version with the if (one memory transaction).

Note that changing the if condition to if (threadIdx.x < 16) will likely result in the same performance as if (threadIdx.x == 0) in your example.

Mark

nasacort · December 12, 2007, 11:19pm

This is where I am still confused. Let us go back to the original example (without the if).

Since there are 32 threads in this example, these will be grouped into 2 half-warps. In

each half-warp, there will be 16 threads reading a total of 16 four-byte floats, or a total

of 16x32 = 512 bits, which are all coalesced.

However, from 5.1.2.1 of the Programming Guide, the device is capable of reading at

most 128 bits in a single instruction. So, to read 512 bits, there will need to be 512/128

= 4 separate load instructions. My question now is: Do each of these 4 load instructions

require 400 to 600 clock cycles of latency, in which case there would be a total of 1600

to 2400 clock cycles of latency!

I must be wrong about how global memory latency works. Please enlighten me if you will.

Sarnath · December 13, 2007, 2:15am

This is a good question. I have been pondering over this as well. Somehow I assumed that there would be some streaming operation over the bus and the 4 floats would be pumped in (Just like how load multiple would work on powerPC) fast… My assumption…

OR perhaps

I think 16 8-bit accessing loads may be coalesced into 1 128-bit load instruction. May b, thats why the half-warp size is 16… But I am surely missing something here. THe global read coalescing should work for floats and ints as well…

DenisR · December 13, 2007, 9:25am

latency is not the same as throughput.
if you have 400 cycles of latency (for all the loads):

clock = 0 read in a float
clock = 1 read in a float
…
…
clock 400 float has arrived
clock 401 float has arrived

so the latency of reading in say 16 floats is (from the moment of reading in the first) 416 cycles and not 400*16

seibert · December 13, 2007, 11:12am

8-bit memory reads (like char) are not coalesced.

Eri_Rubin · December 13, 2007, 11:59am

this is only true if your data access pattern is coalescing, same as a seek operation on a hard disk which takes long, but reading a chunk of consecutive data is fast.

nasacort · December 13, 2007, 8:03pm

Let me see if I understand you. To simply discussion, suppose there are only 2

threads. In the following code,

shared float shared[2];

device float device[2];

shared[threadIdx.x] = device[threadIdx.x];

the timing would, according to Mr. DenisR, look something like this:

clock = 0 read in a float

clock = 1 read in a float

…

clock 400 float (device[0]) has arrived

clock 401 float (device[1]) has arrived

Total time = 401 clock cycles.

But if I change the code to:

shared float shared[2];

device float device[9];

shared[threadIdx.x] = device[8*threadIdx.x];

Now the stride is 8 and so the two reads are no longer from contiguous memory.

The timing would now become something like:

clock = 0 read in a float

…

… (latency)

…

clock 400 float (device[0]) has arrived

clock = 401 read in a float

…

… (latency)

…

clock 801 float (device[8]) has arrived

Total time = 801 clock cycles

Could this be right?

DenisR · December 13, 2007, 8:45pm

I remember to have read someone from nvidia say that uncoalesced reads are serialized. So that would indeed mean that the total time is doubled. However my memory is not that super.

But maybe it is easier to try than to guess? ;) You can always make a small test kernel to check these things out. But I think it is just better to try to keep everything nice and coalesced. And otherwise use textures as I have also read that reading in data from a texture is faster than reading in data from global memory uncoalesced.

nasacort · December 13, 2007, 9:29pm

I will run some tests once I have finished building my new system.

Topic		Replies	Views
Global memory access time Time to read from global to share memor CUDA Programming and Performance	4	3228	July 16, 2007
memory latency CUDA Programming and Performance	5	3927	March 21, 2007
Global memory access cost CUDA Programming and Performance	4	2953	November 18, 2017
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7552	July 21, 2008
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24621	December 21, 2009
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Effective global memory bandwidth? CUDA Programming and Performance	17	17572	September 18, 2007
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5091	September 6, 2008
read from global mem vs write to global mem CUDA Programming and Performance	13	6418	January 22, 2009
global memory latency CUDA Programming and Performance	4	2111	June 22, 2008

global memory latency

Related topics