The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload

Originally published at:

Figuring out how to reduce the GPU frame time of a rendering application on PC is challenging for even the most experienced PC game developers. In this blog post, we describe a performance triage method we’ve been using internally at NVIDIA to let us figure out the main performance limiters of any given GPU workload…

NVIDIA to figure out the main performance limiters of any given GPU workload (also known as perf marker or call range), using NVIDIA-specific hardware metrics.

Will it works with unity and unreal engine!??!


my Top SOL look like this - VRAM 86% L2 48% SM 35% TEX 22%
What does this mean? I tried to read through the article and there is not a lot information when the TOP SOL is VRAM :/

Could you please help thanks!

In this case, your current GPU workload is mainly limited by the throughput of the VRAM (GDDR5 or GDDR5X memory).
As written in the blog post: "If the top SOL unit is the VRAM, and its SOL% value is not poor (>60%), then this workload is VRAM-throughput limited and merging it with another pass should speedup the frame. A typical example is merging a gamma-correction pass with another post-processing pass."
=> I guess your workload is not just a simple copy pass, but something more complex. what are you doing in this workload? SSR?

To speedup VRAM-throughput-limited workloads, you should try and understand what is causing your current VRAM traffic.
Looking at TEX & L2 hit rates can help:
- To reduce the VRAM traffic, you can try to increase your L2 hit rate (which can be done by reducing the working set size of your GPU workload, for instance sampling a half-res texture instead of a full one).
- To try and reduce your L2 hit rate, you can try to increase your TEX/L1 hitrate.
See the second section of this talk for some strategies to attack both hit rates for screen-space sampling algorithms:

If you can replace a texture/buffer format by another that takes less space in VRAM, that should reduce the number of bytes transferred from VRAM and should produce a perf increase in this case of high VRAM SOL%.

Can you please post a screenshot of your Range Profiler output, along with the GPU product name you're running on?

Can you explain the difference between (the warp stall reasons) short and long scoreboard? I could not find any information on what a MIO operation is.

I am analyzing a game, and the Graphics/Compute Idle is always 100%.
The top SOLs : L2: 1.6%, VRAM: 1.3%, SM: 1.3%, TEX: 1.2%, CROP: 0.8%
SM Active is 4.3%

However, when I use Nvidia-smi, the GPU utilization is 37%.

Why it is like that? Is there something wrong with the tool? or because of my mis-operation?