Boosting Application Performance with GPU Memory Access Tuning

jwitsoe · June 27, 2022, 5:51pm

Originally published at: https://developer.nvidia.com/blog/boosting-application-performance-with-gpu-memory-access-tuning/

In this post, we examine a method programmers can use to saturate memory bandwidth on a GPU.

robv · June 27, 2022, 7:34pm

This blog saves the best part for the end. Users interested in tuning performance of their CUDA kernels should always try and use launch bounds first. In this particular case study that would have been enough. But not always, of course.

mark.l.khait · July 7, 2022, 1:34pm

Great walkthrough, thank you! Can you please elaborate on why the duration of all kernel variants suddenly dropped to the same value (~10 ns) starting from ~20th kernel launch (Figure 1)? Does that mean that for an application with thousands of kernel launches all the described optimizations are actually useless?

robv · July 20, 2022, 1:56am

Good question. The plot of kernel durations shown actually repeats itself in the application from which it was derived. Figure 1 shows one full period of that repetition. After the kernel duration drops, it jumps back up again, and substantial overall application performance improvement is obtained from the optimizations described in the blog.

mark.l.khait · July 29, 2022, 10:00am

Thank you, very intriguing! Is there an explanation for such a huge variation (3x-5x!) of kernel durations?

alexsunny734 · July 31, 2022, 11:45am

thanks for the awesome information.

robv · August 8, 2022, 4:43pm

Mark, the drop is due to a smaller data set being processed by the kernel. But, as I said, it’s a cyclic process, and the size increases again periodically.

mark.l.khait · August 9, 2022, 7:42am

Thank you, robv, now it is all clear! I will give it a go for sure.

пн, 8 авг. 2022 г. в 18:45, Robv via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com>:

robv · August 9, 2022, 2:48pm

Very good, Mark. Please let me know if you have more questions.

dwatersg · March 9, 2023, 2:41am

I think there’s a typo in the third paragraph of the launch bounds section: “each thread can use up to 64 threads”. I think that last word should be “registers”.

robv · March 9, 2023, 8:47pm

You are correct again, thank you. We’ll fix the typo.

jwitsoe · March 9, 2023, 10:45pm

Fixed! Thanks, @dwatersg!

alexsunny734 · March 25, 2023, 10:31am

thanks my issue has been fixed.