question about latency of global memory

dear all:
from programming guide 5.1.1.3, it says “Throughput of memory operations is 8 operations per clock cycle.
When accessing local or global memory, there are, in addition, 400 to 600 clock cycles of memory latency.”

it seems that latency of global memory is 400~600 core cycles.

But I wonder how to obtain this number.

I estimate memory latency in the thread http://forums.nvidia.com/index.php?showtop…st=#entry600634

and obtain about 110 core cycles.

In that article, I use DDR2 to model GDDR3 and use generic DRAM model in the book “MEMORY SYSTEMS Cache, DRAM, Disk”, see figure 1.

figure 1: DRAM model

and Gatt chart of a complete ready cycle is shown in figure 2, it relates to row access, column read, data restore and pre-charge.
figure 2:

also from thesis of Wilson Wai Lun Fung (https://circle.ubc.ca/bitstream/2429/2268/1/ubc_2008_fall_fung_wilson_wai_lun.pdf ),
the author adopts GDDR3’s timing parameters provided by Qimonda for their512-Mbit GDDR3 Graphics RAM clocked at 650MHz.
We adopt the same timing parameters except frequency (TeslaC1060 is 800MHz). The parameters are summarized in table 3.

From above data, we can estimate duration of a complete read cycle.

read cycle = tRC = tRAS + tRP= 21 + 13 = 34 (memory cycle)

core frequency of Tesla C1060 is 1.3GHz and memory clock is 400MHz, then

read cycle = 34 (memory clock) x 1.3GHz/400MHz = 110.5 (core cycle)

this number is far from 400~600 cycles.

second, global memory is divided into either 7 partitions (GTX260, GTX295) or 8 partitions (TeslaC1060) of 256-byte width.

and partition camping problem occurs, for example, all SMs access the same partition, then bandwidth is limited to 1/8 or 1/7

of maximum bandwidth. This leads to the question:

Is memory interface also divided into 7 or 8 partitions? say
GTX295 has 448-bit per GPU, 448 = 64 * 7, each partition uses 64-bit interface.
TeslaC1060 has 512-bit interface, 512 = 64*8, each partition uses 64-bit interface.

If so, then this can interpret

“partition camping problem occurs, for example, all SMs access the same partition, then bandwidth is limited to 1/8 or 1/7 of maximum bandwidth”

Yes, the DRAM read cycle is only a very small part of the total latency.

It includes:

  • Virtual address calculation. I suspect that even global memory reads have to take the same path as texture reads, instead of a shortcut.

  • On-chip crossbar interconnect traversal.

  • Virtual to physical address translation.

  • Physical to raw address translation (includes a division/modulo to accommodate non-power-of-two numbers of partitions).

  • Reordering from a deep buffer. Memory controllers aggressively reorder accesses to minimize DRAM page switching and read/write turnaround overheads, trading latency for throughput.

  • The DRAM read cycle itself.

  • Going back through the interconnect.

  • Going through texture filtering units, if there is no shortcut datapath.

We made some attempts at measuring memory latency in the forums: [topic=“80451”]Topic 80451[/topic].

Other measurements from the literature:

http://mc.stanford.edu/cgi-bin/images/6/65…_Volkov_GPU.pdf

http://www.eecg.toronto.edu/~moshovos/CUDA…mark_report.pdf

Thanks for @Sylvain Collange’s comment, I have one question and several observations.

Question: in the thread http://forums.nvidia.com/index.php?showtopic=80451

@Sylvain Collange reports

On a 9800GX2, when varying the data size, keeping a stride of 4K :

- from 4K to 64K : 320 ns

- from 128K to 8MB : 350 ns

- 16MB and more : 500 ns

However I obtain 500 cycles latency under stride is 4K in TeslaC1060

when using cuda_latency.tar.gz provided by @Sylvain Collange in the thread

http://forums.nvidia.com/index.php?showtop…rt=#entry468968

Table 1: keep stride = 4KB and sweep data size from 4KB to 64MB, then latency is about 500 core cycle.

% stream_test( void ) : Stream reads 

% use device 2, name = Tesla C1060

% data_size_min = 4.00 kB

% stride_min = 4096 byte

% data_size_max = 77056.00 kB

% stride_max = 4096 byte

% runs stride size(kB) clocks ns

5	 4096	 4.00	 506	 409.40

5	 4096	 8.00	 508	 411.02

5	 4096	 16.00	 506	 409.40

5	 4096	 32.00	 506	 409.40

5	 4096	 64.00	 506	 409.40

5	 4096	 128.0	 506	 409.40

5	 4096	 256.0	 504	 407.78

5	 4096	 512.0	 506	 409.40

5	 4096	 1024.0	 506	 409.40

5	 4096	 2048.0	 506	 409.40

5	 4096	 4096.0	 498	 402.93

5	 4096	 8192.0	 506	 409.40

5	 4096	 16384.0	 504	 407.78

5	 4096	 32768.0	 504	 407.78

5	 4096	 65536.0	 504	 407.78

Do I miss something so that latency is about 500 cycles?

Observation: what I am concerned is throughput, when I use Gatt chart to analyze bandwidth difference between “float” and “double”,

I am confused that how to embed idea of “partition camping” into Gatt chart. In other words, I want to ask

Is memory interface also divided into 7 or 8 partitions? say

GTX295 has 448-bit per GPU, 448 = 64 * 7, each partition uses 64-bit interface.

TeslaC1060 has 512-bit interface, 512 = 64*8, each partition uses 64-bit interface.

I think that this question is answered in the thread http://forums.nvidia.com/index.php?showtop…rt=#entry457188

@alex dubinsky said

"Btw, there is an alternate explanation for variable latencies. The DRAM is organized into channels, and depending on how you access the channels

(for example, sending all accesses to one or spreading them out) will affect performance by a large amount. "

I think that “channel” is “partition” mentioned in SDK/transposeNew/doc/MatrixTranspose.pdf.

I search “channel GDDR3” on google, then it appears in the SPEC of ATI product, in the white paper of Radeon X1800,

http://ati.amd.com/products/radeonx1k/whit…_Whitepaper.pdf, it shows

(1) The Radeon X1800 Memory Controller takes advantage of this fact by dividing its 256-bit memory interface into eight 32-bit channels, see figure 1

A key issue with moving to a wider memory interface relates to the concept of granularity. For maximum efficiency,

every wire in the interface should ideally be carrying data every clock cycle. for example,

if a request for 32 bits of data was made on a 256-bit interface, it could mean that most of the wires would not be carrying any data

when the request was fulfilled.

GPUs typically address this granularity issue by dividing their memory interfaces into multiple channels.

Each channel can serve one read or write request at a time, so an interface with multiple channels can serve multiple requests simultaneously.

(2) Ring Bus architecture of Radeon X1800, see figure 2

figure 1,

figure 2,

also from http://pc.watch.impress.co.jp/docs/2008/0617/kaigai446.htm by Hiroshige Goto (the link is provided by @Sylvain Collange),

an overview of GT200 is shown in figure 3, it shows that GT200 has 8 channels, 64-bit interface per channel and use 32-bit memory device,

for example, TeslaC1060 use 32 32Mx32 GDDR3 SDRAM.

figure 3,

Conclusion:

when one wants to improve performance of a memory-bound problem, like matrix transpose we have discussed in the thread 

http://forums.nvidia.com/index.php?showtopic=106924 , the partition camping needs to be solved first (or concurrent accesses to global memory by all

active warps should be divided evenly amongst partitions (channels) ) since if all SMs access the same channel, then effective interface is 64-bit, not 512-bit,

then effective bandwidth is only 1/8 of maximum bandwidth, this is independent of how many cores you are using.

Of course, “each channel can only access 64-bit interface” means that my Gatt chart in the thread

http://forums.nvidia.com/index.php?showtop…rt=#entry601970

is wrong.

Moreover if latency is 500 cycles and "read cycle of DRAM" is only 110 cycles, then fixed cost of access DRAM is about 400 cycles

(fixed cost = Virtual address calculation + On-chip crossbar interconnect traversal + Virtual to physical address translation + Physical to raw address translation

  • Physical to raw address translation). This means that if one wants to draw Gatt chart, then he can ignore variation due to bank-conflict, or different rows of the same bank, … etc.