well I am not sure, but the internal memory data rate (transfer time from global memory to registers) is 86.4 GB/s (900 MHz * 2 (DoubeDR) * 384bit ).
In contrast to the memory data rate, when memory is accessed there is always a time delay which is called memory latency. I think that is what the 400 - 600 cycles, i.e. ~0.4e-6 sec., is about.
So no matter how much data you want to transfer, you will not get around the 400 to 600 cycles, and this is also exactly why you want to coalesce data, to keep data as long as possible in shared memory, to have code with high arithmetic intensity.
Example: Transfer 4B of data from global data to shared data
issuing : 4 cycles = ~ 3 ns
memory latency : 600 cycles = ~ 0.4 us
transfering data : ~ 0.0462 ns <— :-( an order of 1e4
sums up to ~ 0.4us !!! memory latency is dominating