latency hiding How much speedup can you get?

alberthendriks · November 9, 2007, 2:26pm

Hi all,

Global memory access is often the bottleneck of a CUDA implementation. Therefore, latency hiding is used, i.e. while thread 1 is waiting for a global memory access, other threads become active and perform arithmetic computations. If there are really a lot of arithmetic computations to do compared to global memory accesses, the arithmetic computations are dominating and the global memory access time can be ignored. Otherwise, the global memory access time is dominating and the arithmetic computations time can be ignored. In either case, the speedup from latency hiding is at most a factor of 2 compared to no latency hiding… at least that’s my theory. Could this be correct? Or does someone have experience with greater speedups?

paulius · November 9, 2007, 10:38pm

Other threads (or even the same one) do not have to necessarily perform arithmetic instructions to hide the latency. You achieve pipelining (and high utilization of the GPU), as long as it has instructions to issue. So, if there are non-depenedent read instructions available for issue, they will contribute to latency hiding as well. That’s where higher occupancy helps - more threads from which to issue instructions.

As far as speedups, given a kernel that only writes global memory (i.e. no arithmetic instructions to schedule after the write), going from 4.2% occupancy to 100% showed a speedup of nearly 7. It’s worth mentioning that for this particular test, occupancies 50% and higher showed pretty much the same performance.

Paulius

alberthendriks · November 9, 2007, 11:25pm

Hi Paulius,

Thanks for your reply. This is interesting information for me. Do you perhaps have a link or something to this test?

paulius · November 10, 2007, 12:53am

I don’t seem to have kept the code. However, it’s easy to recreate. The kernel computes the index to gmem and writes a constant value to that location (pointer to gmem location is the only parameter).

In your experiments, vary occupancy by adjusting smem requirement and block size, keeping the total number of threads constant. For the same block size, you can adjust occupancy by varying the number of smem bytes requested during kernel launch. You’ll have to go with smaller blocks to achieve the lowest occupancies (don’t bother with blocks smaller than 32 threads).

Paulius

Topic		Replies	Views
Hiding memory read latency CUDA Programming and Performance	0	1776	July 16, 2007
Global memory access cost CUDA Programming and Performance	4	3238	November 18, 2017
How to understand the "hide latency" CUDA Programming and Performance	13	4718	August 8, 2024
Latency Hiding Question CUDA Programming and Performance	2	1732	May 13, 2011
Waiting for global memory access. CUDA Programming and Performance	32	56875	January 31, 2008
Design Pattern for Hiding Global Load Latency CUDA Programming and Performance	4	4408	January 24, 2008
Basic question about hiding latency CUDA Programming and Performance	6	2259	July 9, 2014
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14385	November 18, 2008
global memory prefetch is there any way ? CUDA Programming and Performance	8	6463	March 26, 2009
evaluating global memory access trade-off CUDA Programming and Performance	0	874	April 2, 2009

latency hiding How much speedup can you get?

Related topics