In the manual, it says it takes about 400-600 clock cycles to read a float from global to shared memory. Does it mean if I want to read 16N bytes data into 16 microposser (blocks), it will take about 500*16N cycles? Can they be read one after another into different block (microposser) with little delay beteween them?
If the memory read is coalesced, can we read 16 floats in one read instruction?
See here for detailed measurements of global memory transfers.
In all the time I have been talking about this no one has thought about the implication of the 4-600 cycle quoted in the manual - as I say, the real wall clock time is 950 clocks for float read cycle time for fully warp coalesced and 100% occupancy (on 8800GTX that gets ~70Gb/sec, different for each device type) but this is not the number that matters…
What is important is the thread clock time which is the number of clock cycles available to a thread between global float accesses (or local memory register references) and that time needs to be divided by 24 @ 100% occupancy giving only 40 clocks and increases as occupancy drops, to a certain point. This is a new column in the report above that I have not yet posted.
Eric, thanks for the answer. I thought the 70GB/s is the device memory access time, ie, copy from one global memeory position to another position. Is that right?
If there are 2416 (microporcessor (MP)) warps working simultaneously, does it mean the access time is 960/(2416) = 2.5 cycles?
What is the time delay between one MP issue a read and another MP can issue a read?
I cannot access the web address for detailed measurements of global memory transfers.
well I am not sure, but the internal memory data rate (transfer time from global memory to registers) is 86.4 GB/s (900 MHz * 2 (DoubeDR) * 384bit ).
In contrast to the memory data rate, when memory is accessed there is always a time delay which is called memory latency. I think that is what the 400 - 600 cycles, i.e. ~0.4e-6 sec., is about.
So no matter how much data you want to transfer, you will not get around the 400 to 600 cycles, and this is also exactly why you want to coalesce data, to keep data as long as possible in shared memory, to have code with high arithmetic intensity.
Example: Transfer 4B of data from global data to shared data
issuing : 4 cycles = ~ 3 ns
memory latency : 600 cycles = ~ 0.4 us
transfering data : ~ 0.0462 ns <— :-( an order of 1e4
sums up to ~ 0.4us !!! memory latency is dominating
No the ~70Gb is device memory TO shared memory sustainable including all the hardware overheads like refresh and page changes and request/bus size mismatch. There is much confusion out there on the number in the bandwidth test…
No cycle time is 40 clocks thread time @ 100% occupancy for fully coalesced 32 bit reads ONLY.
All the measurements in the above link are for all MPs running concurrently. Nothing is documented but you can bet it is a round robin system.
Sorry fixed the link above - browser did not delete the preselected initial data…