Boosting Application Performance with GPU Memory Prefetching

jwitsoe · March 23, 2022, 3:00pm

Originally published at: https://developer.nvidia.com/blog/boosting-application-performance-with-gpu-memory-prefetching/

This CUDA post examines the effectiveness of methods to hide memory latency using explicit prefetching.

robv · March 23, 2022, 5:09pm

The work described in this post is derived from a real application in computational finance. Please feel free to ask questions about any details that may be unclear.

csdncannon · March 28, 2022, 4:02am

Hello, normally how to decide PDIST per application to hide the memory latency?

robv · March 28, 2022, 3:13pm

Thanks for the question. It is difficult to derive an analytical expression for the proper value of PDIST, because it depends, among others, on the occupancy of the Streaming Multiprocessors (SMs), which in turn is a function of the number of registers used per thread, and the total amount of shared memory used by the kernel, as well as the memory latency. The easiest strategy would be to vary PDIST until optimal performance is achieved. A slightly more focused approach would be to compute how much shared memory there is to spare, using the occupancy view in Nsight Compute, and choosing PDIST such that it is all used for the prefetch buffer. But this is not foolproof, because sometimes it helps to reduce the number of thread blocks per SM somewhat to free up more shared memory.

liusw · November 10, 2022, 2:50am

Hello, shared memory padding strategy is not economic for some circumstances. Does #define vsmem(index) v[threadIdx.x + PDIST*index] works better for this post?
Besides, according to cuda programming guide, for Compute Capability 5.x and later, shared memory has 32 banks with 32-bit word. So there is no way to make a conflict-free read for double type?

robv · November 10, 2022, 3:33am

Yes, it would work better. As I wrote in the blog: “We could actually have arrived at this performance improvement without resorting to padding by changing the indexing scheme of the array in shared memory, which is left as an exercise for the reader.” You did the exercise!
It is indeed impossible to avoid conflicts with 64b words, but the point is that the indexing you proposed minimizes conflicts.

dwatersg · March 8, 2023, 6:25am

The indexing into v should be threadIdx.x + blockDim.x*index right? Each thread essentially gets its own column (would equate to a bank for 32b words).

robv · March 9, 2023, 12:02am

Yes, you are right, I was too quick to respond to respondent liuws’s suggestion. Thank you for pointing out my error.

Topic		Replies	Views
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	517	June 11, 2025
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1624	May 14, 2019
Boosting Application Performance with GPU Memory Access Tuning Technical Blog	12	1129	March 25, 2023
Unified Memory for CUDA Beginners Technical Blog	46	3319	December 1, 2023
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11720	May 23, 2010
ask for code example of prefetching CUDA Programming and Performance	4	3249	November 14, 2008
Does the prefetch instruction delay the loading of the ld instruction? CUDA Programming and Performance	4	343	August 9, 2024
global memory prefetch is there any way ? CUDA Programming and Performance	8	6461	March 26, 2009
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4909	June 22, 2011
NVPROF showing GPU Fault though I am using cudaPrefetch CUDA Programming and Performance	6	673	December 27, 2023

Boosting Application Performance with GPU Memory Prefetching

Related topics