memory latency

benoit · March 15, 2007, 4:02pm

is this per global memory call ?

meaning if I call 3 or 5 in the row will I have 300cycles * 3 or * 5 latency ?

I was just wondering if it is better to have all threads loading small amount from the global,

or just having one loading a bigger block for everybody ?

prkipfer · March 15, 2007, 4:29pm

Depends on how much can be done as 128bit reads. See manual section 6.1.2.1 for how to specify alignment and how access from multiple threads to the same memory location is coalesced.

Peter

AlexTutubalin · March 15, 2007, 6:44pm

If you read many data from global memory with proper alignment, memory coalescing and so on… something like

shared float shared[tid] = global[block*BLOCKS+tid];

then you’ll get read speed about 65-70Gb/sec (so, several clocks per float)

On opposite side, with very unoptimal global memory reading i.e. in column order:

float shared[tid] = global[tid*BLOCKS + block];

you’ll got two orders of magnitude lower read rates. If you need to read in column order, try to use texfetch(…)

I’ve done some experiments on big memory reads and documented them… in Russian :). You can try to use www.translate.ru to translate my blog posts:

[url=“http://blog.lexa.ru/2007/03/08/nvidia_8800gtx_propusknaja_sposobnost__pamjati_pri_ispol_zovanii_cuda.html”]http://blog.lexa.ru/2007/03/08/nvidia_8800...vanii_cuda.html[/url]
[url=“NVidia 8800GTX: скорость чтения текстур | blog.lexa.ru”]http://blog.lexa.ru/2007/03/08/nvidia_8800...ja_tekstur.html[/url]

e.ping · March 16, 2007, 2:22pm

Nice ! Any chance that you translate them in english (I have a hard time fully understanding what translate.ru returns me) ?

Thx

– Nicolas

AlexTutubalin · March 17, 2007, 7:01am

Nicolas, I can read English texts freely, but writting in English is too hard for me (I’ve no practice since 1998).

Anyway, I can summarize my benchmark results:

The fastest way to read from global memory is to read 4-byte (float) values one per thread with 4-byte stride (see code in blog):

for(rowN=bx; rowN < SIZE; rowN+=blocks){

    for(colN=tid; colN < SIZE; colN+=threads){

      sum += g_idata[rowN*SIZE+colN];

    	}

  	}

float4 fetches are two times slower, column-order fetches are two times of magnitude slower (0.9 Gb/s instead of 70).

Read speed depends on your grid setup:

A. you need many threads blocks (CTA), because of global memory aligment (in my sample, read offsets varies as gridDim.x * matrix row size). 1024 blocks is good initial value.

B. You need many thread in CTA. Thread count SHOULD be multiplication of 32

(192 or 256 or 320 is good start value to try)

C. Optimal thread count in CTA depends of thread register usage ( reg=NN in .cubin file). Each multiprocessor has 32kb register file (so 8192 floats). For kernel with 12 used registed you can run 682 threads on multiprocessor (hardwired maximum is 786). So, you cannot run three CTA of 256 threads in parallel, but two ones with 320 threads can be executed.

Sorry for my broken (Russian-alike) English. Feel free to ask.

prkipfer · March 21, 2007, 1:41pm

Hey Alex,

very nice work! I was able to reproduce some of the graphs. But I cannot see where you get the hardwired 786 threads on one multiprocessor from?

Peter

Topic		Replies	Views
global memory latency CUDA Programming and Performance	12	16171	December 13, 2007
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7552	July 21, 2008
Effective global memory bandwidth? CUDA Programming and Performance	17	17572	September 18, 2007
Global memory access time Time to read from global to share memor CUDA Programming and Performance	4	3224	July 16, 2007
Shared memory bandwidth CUDA Programming and Performance	10	8509	November 10, 2007
Global memory access cost CUDA Programming and Performance	4	2932	November 18, 2017
question about latency of global memory CUDA Programming and Performance	2	22599	October 23, 2009
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8935	January 24, 2008
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5965	November 27, 2010
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8349	February 17, 2008

memory latency

Related topics