memory latency

is this per global memory call ?

meaning if I call 3 or 5 in the row will I have 300cycles * 3 or * 5 latency ?

I was just wondering if it is better to have all threads loading small amount from the global,

or just having one loading a bigger block for everybody ?

Depends on how much can be done as 128bit reads. See manual section for how to specify alignment and how access from multiple threads to the same memory location is coalesced.


If you read many data from global memory with proper alignment, memory coalescing and so on… something like

shared float shared[tid] = global[block*BLOCKS+tid];

then you’ll get read speed about 65-70Gb/sec (so, several clocks per float)

On opposite side, with very unoptimal global memory reading i.e. in column order:

float shared[tid] = global[tid*BLOCKS + block];

you’ll got two orders of magnitude lower read rates. If you need to read in column order, try to use texfetch(…)

I’ve done some experiments on big memory reads and documented them… in Russian :). You can try to use to translate my blog posts:…vanii_cuda.html…ja_tekstur.html

Nice ! Any chance that you translate them in english (I have a hard time fully understanding what returns me) ?


– Nicolas

Nicolas, I can read English texts freely, but writting in English is too hard for me (I’ve no practice since 1998).

Anyway, I can summarize my benchmark results:

  1. The fastest way to read from global memory is to read 4-byte (float) values one per thread with 4-byte stride (see code in blog):
for(rowN=bx; rowN < SIZE; rowN+=blocks){

    for(colN=tid; colN < SIZE; colN+=threads){

      sum += g_idata[rowN*SIZE+colN];



float4 fetches are two times slower, column-order fetches are two times of magnitude slower (0.9 Gb/s instead of 70).

  1. Read speed depends on your grid setup:

A. you need many threads blocks (CTA), because of global memory aligment (in my sample, read offsets varies as gridDim.x * matrix row size). 1024 blocks is good initial value.

B. You need many thread in CTA. Thread count SHOULD be multiplication of 32

(192 or 256 or 320 is good start value to try)

C. Optimal thread count in CTA depends of thread register usage ( reg=NN in .cubin file). Each multiprocessor has 32kb register file (so 8192 floats). For kernel with 12 used registed you can run 682 threads on multiprocessor (hardwired maximum is 786). So, you cannot run three CTA of 256 threads in parallel, but two ones with 320 threads can be executed.

Sorry for my broken (Russian-alike) English. Feel free to ask.

Hey Alex,

very nice work! I was able to reproduce some of the graphs. But I cannot see where you get the hardwired 786 threads on one multiprocessor from?