Boosting Application Performance with GPU Memory Prefetching

Originally published at: https://developer.nvidia.com/blog/boosting-application-performance-with-gpu-memory-prefetching/

This CUDA post examines the effectiveness of methods to hide memory latency using explicit prefetching.

The work described in this post is derived from a real application in computational finance. Please feel free to ask questions about any details that may be unclear.

Hello, normally how to decide PDIST per application to hide the memory latency?

Thanks for the question. It is difficult to derive an analytical expression for the proper value of PDIST, because it depends, among others, on the occupancy of the Streaming Multiprocessors (SMs), which in turn is a function of the number of registers used per thread, and the total amount of shared memory used by the kernel, as well as the memory latency. The easiest strategy would be to vary PDIST until optimal performance is achieved. A slightly more focused approach would be to compute how much shared memory there is to spare, using the occupancy view in Nsight Compute, and choosing PDIST such that it is all used for the prefetch buffer. But this is not foolproof, because sometimes it helps to reduce the number of thread blocks per SM somewhat to free up more shared memory.

1 Like

Hello, shared memory padding strategy is not economic for some circumstances. Does #define vsmem(index) v[threadIdx.x + PDIST*index] works better for this post?
Besides, according to cuda programming guide, for Compute Capability 5.x and later, shared memory has 32 banks with 32-bit word. So there is no way to make a conflict-free read for double type?

Yes, it would work better. As I wrote in the blog: “We could actually have arrived at this performance improvement without resorting to padding by changing the indexing scheme of the array in shared memory, which is left as an exercise for the reader.” You did the exercise!
It is indeed impossible to avoid conflicts with 64b words, but the point is that the indexing you proposed minimizes conflicts.

The indexing into v should be threadIdx.x + blockDim.x*index right? Each thread essentially gets its own column (would equate to a bank for 32b words).

Yes, you are right, I was too quick to respond to respondent liuws’s suggestion. Thank you for pointing out my error.