GTC 2020: What the Profiler is Telling You: How to Get the Most Performance out of Your Hardware

GTC 2020 S22141
Presenters: Markus Hrywniak,NVIDIA; Milos Maric,NVIDIA
We’ll explore how to analyze and optimize the performance of GPU-accelerated applications. Working with a real-world example, we’ll start by identifying high-level bottlenecks, then walk through an analysis-driven process leading to a series of kernel-level optimizations. Using NVIDIA’s Nsight Systems and Nsight Compute profiling tools as an example, you’ll learn about the fundamental performance limiters: instruction throughput, memory throughput, and latency. We’ll present strategies to identify and tackle each type of limiter.

Watch this session
Join in the conversation below.

Hi Markus, thanks for the talk, I learned a lot and the new tools look very impressive.

I have a more general question regarding profiling and optimization. My question is to what extent can the developer control the actual execution of the overlapping of different streams? In other words, how much of the kernel overlapping is determined by the developer versus the scheduler?

I have a multi-GPU code and I use streams to overlap computation with communication where possible. I have found that if I write equivalent functions with different ordering (although the inter-stream dependencies are the same) I get slightly different overlap and different performance (as indicated by NVVP). For example, one version of the code completely overlaps the communication while others have it go slightly beyond the kernel it was overlapping. The timings are slightly different between each other although within 1-5% of each other. For multi-hour runs, though, 1-5% can be a lot of compute time.

Since there are many possible variations of the same code (e.g. I can use 2 streams to indicate dependence or more streams combined with events) what is the best approach to guiding the scheduler to produce the best timeline possible? If possible, are there any literature sources I can read on this topic?


Hi Todd, glad you liked the talk. I think you already listed the main options you have. One other thing you could try if it makes sense for your application is setting stream priority:

In any case, using multiple streams and events to signal dependencies is the correct way to model things. The scheduling at runtime depends on the work that has been submitted to the GPU. The block scheduler is free to execute any active block on any SM and there is no programmatic way to influence this further.

Your case might be more subtle though. If you’re seeing timing differences that don’t make sense, you could post another thread in the CUDA subforum ( with some profiling data attached.

Thanks Markus. I will look into priorities first to see if that helps with my scheduling. If I am still having a good amount of variability I’ll post the issue on the forums.

Thanks again.