Bolt is a binary optimization tool developed by Facebook. Google has also developed a corresponding tool called Propeller. Does Nvidia have any plans to develop similar tools? Is there any significant performance improvement on Nvidia hardware?
History indicates that NVIDIA does not discuss future product plans in public settings, but is generally responsive to feature requests from users. If there is additional functionality you would like to see, consider filing a feature request via the bug reporting form.
What do these two tools do, and what are you hoping to accomplish? Do Bolt and Propeller actually offer equivalent functionality? I know nothing about Bolt and my very rudimentary understanding is that Propeller offers profile guided optimization (PGO), something that the classic Intel compiler (now discontinued) offered for many years on x86 platforms.
As a user of the Intel compilers since the mid 1990s my experience is that this worked, but that the achievable performance improvements were modest and got smaller as more sophisticated CPU architectures were deployed, usually not more than a single-digit percent increase. The new clang
based Intel compilers do not support PGO (yet?) or at least did not last I checked.
Speaking of binary optimization, I don’t think there’s currently enough opportunity for binary layout optimization for GPU code. The reasons are the following:
- Executable code size: CPU binaries can have 100s of megabytes of code, primarily driven by a) general purpose nature of the code, b) code reuse/external dependencies, c) performance optimizations: static linking and aggressive inlining. My understanding is that GPU code is primarily compute kernels that are fixed function and bounded in size. There’s less opportunity for code layout optimizations with smaller code footprint.
- Execution model: CPU core fetches only one thread at a time, so if the workload if frontend-bound, the reduction in stalls directly translates to speedup. GPUs employ massive multithreaded execution having 1000s of active threads where the execution is overlapped, so one stalled thread only causes it to wait but not others.
But: reason #1 might be getting less relevant since kernels can have branching (although it’s discouraged), plus they can be generated by the compiler from models, and GPUs can execute multiple kernels simultaneously which adds to the total code footprint. And reason #2 doesn’t mean that GPU code fetch will never become a bottleneck for some workloads at some point.
Lastly, there are incentives: I think it’s not accidental that BOLT/Propeller were developed by hyperscalers rather than HW vendors. The latter may still be interested in SW optimizations as a differentiating feature or sales enabler though.
Layout optimizations can be applied to data as well, and given that GPUs have a cache hierarchy, TLBs, etc just like CPUs, such optimizations may well apply to GPUs.
Instead of inquiring about and/or requesting “binary optimization tools” from NVIDIA, I think it would be more productive to narrow this down to support for specific techniques in the CUDA toolchain that have been shown to result in useful performance increments in real-life use case or at least focused research.
That is the reason I asked what specific techniques Bolt / Propeller apply and wonder what kind of integration with profilers this requires. I find it telling that in the multi-year transition (2021 to 2024) from their home-grown compilers to clang
-based compilers Intel apparently chose not to port PGO support to the latter framework, suggesting that (1) any performance gains to be had may be marginal (2) profiler integration may be difficult.
To provide some context: BOLT and Propeller primarily do profile-guided code layout: function splitting, function and block ordering. I assume these were specific techniques requested by OP.
I find it telling that in the multi-year transition (2021 to 2024) from their home-grown compilers to
clang
-based compilers Intel apparently chose not to port PGO support to the latter framework, suggesting that (1) any performance gains to be had may be marginal (2) profiler integration may be difficult.
Clang/LLVM have mature implementations of PGO, I would expect that to be the main reason. Performance effect of PGO varies by workload, with double digit pct speedups not uncommon for large code footprint workloads. I tested the effect of PGO and BOLT with Clang as workload, and found their total IPC improvement comparable to uarch generational uplift. Profiling is a separate story, but AutoFDO/CSSPGO/BOLT all work with sampled profile making them feasible in production environments.
If that is so, I agree that this would be of very limited utility to CUDA device code. Where I could imagine positive performance impact from the use of profiler data is in the handling of branches, e.g. straight-line flow for the most common case, avoiding branch conversion where one side of the branch is heavily favored.
I guess I need to look more closely at the clang
-derived icx
compiler again, because it seemed to me that it has no support for PGO at all when I investigated this a year ago. I am sure Intel had good reasons for the technology switch, but as a long-time (and paying, for most of that time) user of the Intel compilers, I was a bit miffed that the new compilers have inferior feature set and more rough edges compared to the “Intel classic” compilers, in particular also the precise control of floating-point computation.
the use of profiler data is in the handling of branches, e.g. straight-line flow for the most common case
Prioritizing fall-throughs is actually a part of block ordering. But I don’t think post-link optimizations are necessary for device code where compiler PGO might provide all of the benefit. Code layout algorithms are unified across compiler/linker/BOLT in LLVM (CDSort for function ordering, ext-tsp for block ordering), and performance difference comes from more accurate profile mapping.