Boosting Application Performance with GPU Memory Prefetching

Originally published at:

This CUDA post examines the effectiveness of methods to hide memory latency using explicit prefetching.

The work described in this post is derived from a real application in computational finance. Please feel free to ask questions about any details that may be unclear.

Hello, normally how to decide PDIST per application to hide the memory latency?

Thanks for the question. It is difficult to derive an analytical expression for the proper value of PDIST, because it depends, among others, on the occupancy of the Streaming Multiprocessors (SMs), which in turn is a function of the number of registers used per thread, and the total amount of shared memory used by the kernel, as well as the memory latency. The easiest strategy would be to vary PDIST until optimal performance is achieved. A slightly more focused approach would be to compute how much shared memory there is to spare, using the occupancy view in Nsight Compute, and choosing PDIST such that it is all used for the prefetch buffer. But this is not foolproof, because sometimes it helps to reduce the number of thread blocks per SM somewhat to free up more shared memory.

1 Like