Boosting Application Performance with GPU Memory Access Tuning

Originally published at:

In this post, we examine a method programmers can use to saturate memory bandwidth on a GPU.

This blog saves the best part for the end. Users interested in tuning performance of their CUDA kernels should always try and use launch bounds first. In this particular case study that would have been enough. But not always, of course.

Great walkthrough, thank you! Can you please elaborate on why the duration of all kernel variants suddenly dropped to the same value (~10 ns) starting from ~20th kernel launch (Figure 1)? Does that mean that for an application with thousands of kernel launches all the described optimizations are actually useless?

Good question. The plot of kernel durations shown actually repeats itself in the application from which it was derived. Figure 1 shows one full period of that repetition. After the kernel duration drops, it jumps back up again, and substantial overall application performance improvement is obtained from the optimizations described in the blog.

Thank you, very intriguing! Is there an explanation for such a huge variation (3x-5x!) of kernel durations?

thanks for the awesome information.

Mark, the drop is due to a smaller data set being processed by the kernel. But, as I said, it’s a cyclic process, and the size increases again periodically.

Thank you, robv, now it is all clear! I will give it a go for sure.

пн, 8 авг. 2022 г. в 18:45, Robv via NVIDIA Developer Forums <>:

Very good, Mark. Please let me know if you have more questions.

I think there’s a typo in the third paragraph of the launch bounds section: “each thread can use up to 64 threads”. I think that last word should be “registers”.

You are correct again, thank you. We’ll fix the typo.

Fixed! Thanks, @dwatersg!

1 Like

thanks my issue has been fixed.