Suppose i am writing a cuda code in which only global memory fetch is there. So i wanted to know. If one half-warp is taking 600 cycles for memory fetch. How much time 8 half-warps(my thread block contains 8 half warp ie 128 threads) will take. Will it e equal to 600*8 or (600 + some delta time)memory pipeline will help us in some way. I wanted to know how effective memory pipeline is.
The memory pipeline is effective enough to practically max out the available memory bandwidth, attaining ~70 GiB/s on an 8800 GTX.
i think you misinterpreted my question. My question is very specific. You gave a very general answer to it. Thanks neways…
The pipeline on the GPU is optimized to overlap computation and memory reads, which you are probably aware so I won’t bore you with the details.
Optimally, the pipeline is extremely effective. All memory requests are submitted from the warp to the memory controller and those warps can continue executing instructions that don’t depend on the results of those memory reads.
So, the total time is the latency is 600 + (delta time) * num_warps_on_multiproc. However, the memory controller can only supply so much bandwidth from the global memory, so additional requests may have to wait longer. The amount of time they must wait is variable based on the number of memory requests currently active in the controller which depends on the order of execution of all prevision instructions in all warps (which is undetermined) … Hence my original answer: the memory system can provide near the full bandwidth available. The actual latency for any particular memory access cannot be determined.
thanks…that was close to my question… External Media