Global memory access time Time to read from global to share memor

e.ping · July 15, 2007, 7:20pm

In the manual, it says it takes about 400-600 clock cycles to read a float from global to shared memory. Does it mean if I want to read 16N bytes data into 16 microposser (blocks), it will take about 500*16N cycles? Can they be read one after another into different block (microposser) with little delay beteween them?

If the memory read is coalesced, can we read 16 floats in one read instruction?

osiris1 · July 15, 2007, 11:58pm

See here for detailed measurements of global memory transfers.

In all the time I have been talking about this no one has thought about the implication of the 4-600 cycle quoted in the manual - as I say, the real wall clock time is 950 clocks for float read cycle time for fully warp coalesced and 100% occupancy (on 8800GTX that gets ~70Gb/sec, different for each device type) but this is not the number that matters…

What is important is the thread clock time which is the number of clock cycles available to a thread between global float accesses (or local memory register references) and that time needs to be divided by 24 @ 100% occupancy giving only 40 clocks and increases as occupancy drops, to a certain point. This is a new column in the report above that I have not yet posted.

Eric

ed: fixed broken link

e.ping · July 16, 2007, 1:36pm

See here for detailed measurements of global memory transfers.

In all the time I have been talking about this no one has thought about the implication of the 4-600 cycle quoted in the manual - as I say, the real wall clock time is 950 clocks for float read cycle time for fully warp coalesced and 100% occupancy (on 8800GTX that gets ~70Gb/sec, different for each device type) but this is not the number that matters…

What is important is the thread clock time which is the number of clock cycles available to a thread between global float accesses (or local memory register references) and that time needs to be divided by 24 @ 100% occupancy giving only 40 clocks and increases as occupancy drops, to a certain point. This is a new column in the report above that I have not yet posted.

Eric

[snapback]223182[/snapback]

Eric, thanks for the answer. I thought the 70GB/s is the device memory access time, ie, copy from one global memeory position to another position. Is that right?

If there are 2416 (microporcessor (MP)) warps working simultaneously, does it mean the access time is 960/(2416) = 2.5 cycles?

What is the time delay between one MP issue a read and another MP can issue a read?

I cannot access the web address for detailed measurements of global memory transfers.

sicb0161 · July 16, 2007, 5:58pm

Hi,

well I am not sure, but the internal memory data rate (transfer time from global memory to registers) is 86.4 GB/s (900 MHz * 2 (DoubeDR) * 384bit ).

In contrast to the memory data rate, when memory is accessed there is always a time delay which is called memory latency. I think that is what the 400 - 600 cycles, i.e. ~0.4e-6 sec., is about.

So no matter how much data you want to transfer, you will not get around the 400 to 600 cycles, and this is also exactly why you want to coalesce data, to keep data as long as possible in shared memory, to have code with high arithmetic intensity.

Example: Transfer 4B of data from global data to shared data
issuing : 4 cycles = ~ 3 ns
memory latency : 600 cycles = ~ 0.4 us
transfering data : ~ 0.0462 ns <— :-( an order of 1e4

sums up to ~ 0.4us !!! memory latency is dominating

osiris1 · July 16, 2007, 11:13pm

No the ~70Gb is device memory TO shared memory sustainable including all the hardware overheads like refresh and page changes and request/bus size mismatch. There is much confusion out there on the number in the bandwidth test…

No cycle time is 40 clocks thread time @ 100% occupancy for fully coalesced 32 bit reads ONLY.

All the measurements in the above link are for all MPs running concurrently. Nothing is documented but you can bet it is a round robin system.

Sorry fixed the link above - browser did not delete the preselected initial data…

Eric

Topic		Replies	Views
global memory latency CUDA Programming and Performance	12	16167	December 13, 2007
memory latency CUDA Programming and Performance	5	3927	March 21, 2007
Global memory overhead CUDA Programming and Performance	3	2075	February 9, 2008
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7552	July 21, 2008
Global memory access cost CUDA Programming and Performance	4	2912	November 18, 2017
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8935	January 24, 2008
global memory latency CUDA Programming and Performance	4	2111	June 22, 2008
global memory access synchronous or asynchronous read/write? CUDA Programming and Performance	3	3408	May 15, 2008
global memory read after write CUDA Programming and Performance	4	3269	March 25, 2009
question about latency of global memory CUDA Programming and Performance	2	22598	October 23, 2009

Global memory access time Time to read from global to share memor

Related topics