Is there a binary optimization tool like llvm-bolt

521yiyi1414 · December 7, 2024, 2:07pm

Bolt is a binary optimization tool developed by Facebook. Google has also developed a corresponding tool called Propeller. Does Nvidia have any plans to develop similar tools? Is there any significant performance improvement on Nvidia hardware?

njuffa · December 7, 2024, 4:21pm

History indicates that NVIDIA does not discuss future product plans in public settings, but is generally responsive to feature requests from users. If there is additional functionality you would like to see, consider filing a feature request via the bug reporting form.

What do these two tools do, and what are you hoping to accomplish? Do Bolt and Propeller actually offer equivalent functionality? I know nothing about Bolt and my very rudimentary understanding is that Propeller offers profile guided optimization (PGO), something that the classic Intel compiler (now discontinued) offered for many years on x86 platforms.

As a user of the Intel compilers since the mid 1990s my experience is that this worked, but that the achievable performance improvements were modest and got smaller as more sophisticated CPU architectures were deployed, usually not more than a single-digit percent increase. The new clang based Intel compilers do not support PGO (yet?) or at least did not last I checked.

amir.aupov · December 7, 2024, 9:47pm

Speaking of binary optimization, I don’t think there’s currently enough opportunity for binary layout optimization for GPU code. The reasons are the following:

Executable code size: CPU binaries can have 100s of megabytes of code, primarily driven by a) general purpose nature of the code, b) code reuse/external dependencies, c) performance optimizations: static linking and aggressive inlining. My understanding is that GPU code is primarily compute kernels that are fixed function and bounded in size. There’s less opportunity for code layout optimizations with smaller code footprint.
Execution model: CPU core fetches only one thread at a time, so if the workload if frontend-bound, the reduction in stalls directly translates to speedup. GPUs employ massive multithreaded execution having 1000s of active threads where the execution is overlapped, so one stalled thread only causes it to wait but not others.

But: reason #1 might be getting less relevant since kernels can have branching (although it’s discouraged), plus they can be generated by the compiler from models, and GPUs can execute multiple kernels simultaneously which adds to the total code footprint. And reason #2 doesn’t mean that GPU code fetch will never become a bottleneck for some workloads at some point.

Lastly, there are incentives: I think it’s not accidental that BOLT/Propeller were developed by hyperscalers rather than HW vendors. The latter may still be interested in SW optimizations as a differentiating feature or sales enabler though.

njuffa · December 7, 2024, 10:18pm

Layout optimizations can be applied to data as well, and given that GPUs have a cache hierarchy, TLBs, etc just like CPUs, such optimizations may well apply to GPUs.

Instead of inquiring about and/or requesting “binary optimization tools” from NVIDIA, I think it would be more productive to narrow this down to support for specific techniques in the CUDA toolchain that have been shown to result in useful performance increments in real-life use case or at least focused research.

That is the reason I asked what specific techniques Bolt / Propeller apply and wonder what kind of integration with profilers this requires. I find it telling that in the multi-year transition (2021 to 2024) from their home-grown compilers to clang-based compilers Intel apparently chose not to port PGO support to the latter framework, suggesting that (1) any performance gains to be had may be marginal (2) profiler integration may be difficult.

amir.aupov · December 8, 2024, 6:14am

To provide some context: BOLT and Propeller primarily do profile-guided code layout: function splitting, function and block ordering. I assume these were specific techniques requested by OP.

I find it telling that in the multi-year transition (2021 to 2024) from their home-grown compilers to clang -based compilers Intel apparently chose not to port PGO support to the latter framework, suggesting that (1) any performance gains to be had may be marginal (2) profiler integration may be difficult.

Clang/LLVM have mature implementations of PGO, I would expect that to be the main reason. Performance effect of PGO varies by workload, with double digit pct speedups not uncommon for large code footprint workloads. I tested the effect of PGO and BOLT with Clang as workload, and found their total IPC improvement comparable to uarch generational uplift. Profiling is a separate story, but AutoFDO/CSSPGO/BOLT all work with sampled profile making them feasible in production environments.

njuffa · December 8, 2024, 6:46am

If that is so, I agree that this would be of very limited utility to CUDA device code. Where I could imagine positive performance impact from the use of profiler data is in the handling of branches, e.g. straight-line flow for the most common case, avoiding branch conversion where one side of the branch is heavily favored.

I guess I need to look more closely at the clang-derived icx compiler again, because it seemed to me that it has no support for PGO at all when I investigated this a year ago. I am sure Intel had good reasons for the technology switch, but as a long-time (and paying, for most of that time) user of the Intel compilers, I was a bit miffed that the new compilers have inferior feature set and more rough edges compared to the “Intel classic” compilers, in particular also the precise control of floating-point computation.

amir.aupov · December 8, 2024, 6:17pm

the use of profiler data is in the handling of branches, e.g. straight-line flow for the most common case

Prioritizing fall-throughs is actually a part of block ordering. But I don’t think post-link optimizations are necessary for device code where compiler PGO might provide all of the benefit. Code layout algorithms are unified across compiler/linker/BOLT in LLVM (CDSort for function ordering, ext-tsp for block ordering), and performance difference comes from more accurate profile mapping.

Topic		Replies	Views
Different output of code when not unrolling loop CUDA Programming and Performance	16	1112	August 22, 2022
A different way to think about writing GPU applications CUDA Programming and Performance	18	3642	January 4, 2013
Are there plans to implement -ffinite-math-only -fno-signed-zeros? CUDA Programming and Performance	11	91	December 2, 2024
C++ kernel language hope of incoming support? CUDA Programming and Performance	9	10499	February 6, 2012
Difference in Performance CUDA Programming and Performance	13	9747	August 20, 2008
Google gpucc vs. Nvidia nvcc? CUDA Programming and Performance	8	6619	April 26, 2016
CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library Technical Blog	6	632	August 22, 2024
low level hardware documentation CUDA Programming and Performance	23	3570	November 28, 2014
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	3537	March 10, 2018
Nsight Compute not detecting kernel launch Nsight Compute profiling	13	3111	May 6, 2021

Is there a binary optimization tool like llvm-bolt

Related topics