latency for global memory access (in cycles)

Anyone has an idea what is the latency for global memory access?
using clock() function, I got around 900 cycles for the first memory read.
Then around 700 cycles for the second and third memory read. These three reads
are consecutive somehow. So the weird thing is why the next two take shorter latency impact.

From the programming guide (section 5.1.1.3 memory instructions), it is said around 400 to 600 cycles. Considering my machine (8800 GTX, 1.35G for shader, 900M for memory), 900 is 600(in memory clock) and 700 is in 467(in memory clock).

Any comment?