latency hiding How much speedup can you get?

Hi all,

Global memory access is often the bottleneck of a CUDA implementation. Therefore, latency hiding is used, i.e. while thread 1 is waiting for a global memory access, other threads become active and perform arithmetic computations. If there are really a lot of arithmetic computations to do compared to global memory accesses, the arithmetic computations are dominating and the global memory access time can be ignored. Otherwise, the global memory access time is dominating and the arithmetic computations time can be ignored. In either case, the speedup from latency hiding is at most a factor of 2 compared to no latency hiding… at least that’s my theory. Could this be correct? Or does someone have experience with greater speedups?

Other threads (or even the same one) do not have to necessarily perform arithmetic instructions to hide the latency. You achieve pipelining (and high utilization of the GPU), as long as it has instructions to issue. So, if there are non-depenedent read instructions available for issue, they will contribute to latency hiding as well. That’s where higher occupancy helps - more threads from which to issue instructions.

As far as speedups, given a kernel that only writes global memory (i.e. no arithmetic instructions to schedule after the write), going from 4.2% occupancy to 100% showed a speedup of nearly 7. It’s worth mentioning that for this particular test, occupancies 50% and higher showed pretty much the same performance.

Paulius

Hi Paulius,

Thanks for your reply. This is interesting information for me. Do you perhaps have a link or something to this test?

I don’t seem to have kept the code. However, it’s easy to recreate. The kernel computes the index to gmem and writes a constant value to that location (pointer to gmem location is the only parameter).

In your experiments, vary occupancy by adjusting smem requirement and block size, keeping the total number of threads constant. For the same block size, you can adjust occupancy by varying the number of smem bytes requested during kernel launch. You’ll have to go with smaller blocks to achieve the lowest occupancies (don’t bother with blocks smaller than 32 threads).

Paulius